Generation and use of multiple speech processing transforms

US 9,218,806 B1
Filed: 05/10/2013
Issued: 12/22/2015
Est. Priority Date: 05/10/2013
Status: Active Grant

First Claim

Patent Images

1. A system comprising:

a computer-readable memory storing executable instructions; and

one or more processors in communication with the computer-readable memory, wherein the one or more processors are programmed by the executable instructions to at least;

receive, from a client device, audio data regarding a user utterance;

obtain, based at least partly on the audio data, at least a first transform and a second transform of a plurality of feature vector transforms associated with the client device;

generate a plurality of feature vectors based at least partly on the audio data;

apply the first transform, the second transform, and a default transform to at least a portion of the plurality of feature vectors to generate first transformed feature vectors, second transformed feature vectors, and default feature vectors respectively;

determine, based at least partly on a performance score, that the default feature vectors provide a better match to an acoustic model than the first transformed feature vectors and the second transformed feature vectors; and

discard at least one of the first transform or the second transform and create a new transform, the new transform based at least partly on the default transform and speech recognition statistics regarding a speech recognition pass using the default transform, wherein the new transform is associated with the client device.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Features are disclosed for selecting and using multiple transforms associated with a particular remote device for use in automatic speech recognition (“ASR”). Each transform may be based on statistics that have been generated from processing utterances that share some characteristic (e.g., acoustic characteristics, time frame within which the utterances where processed, etc.). When an utterance is received from the remote device, a particular transform or set of transforms may be selected for use in speech processing based on data obtained from the remote device, speech processing of a portion of the utterance, speech processing of prior utterances, etc. The transform or transforms used in processing the utterances may then be updated based on the results of the speech processing.

10 Citations

View as Search Results

25 Claims

1. A system comprising:
- a computer-readable memory storing executable instructions; and
  
  one or more processors in communication with the computer-readable memory, wherein the one or more processors are programmed by the executable instructions to at least;
  
  receive, from a client device, audio data regarding a user utterance;
  
  obtain, based at least partly on the audio data, at least a first transform and a second transform of a plurality of feature vector transforms associated with the client device;
  
  generate a plurality of feature vectors based at least partly on the audio data;
  
  apply the first transform, the second transform, and a default transform to at least a portion of the plurality of feature vectors to generate first transformed feature vectors, second transformed feature vectors, and default feature vectors respectively;
  
  determine, based at least partly on a performance score, that the default feature vectors provide a better match to an acoustic model than the first transformed feature vectors and the second transformed feature vectors; and
  
  discard at least one of the first transform or the second transform and create a new transform, the new transform based at least partly on the default transform and speech recognition statistics regarding a speech recognition pass using the default transform, wherein the new transform is associated with the client device.
- View Dependent Claims (2, 3, 4)
- - 2. The system of claim 1, wherein the one or more processors are further programmed to:
    - determine, based at least partly on a performance score, that the first transformed feature vectors provide a better match to the acoustic model than the second transformed feature vectors and the default feature vectors; and
      
      modify the first transform based at least partly on speech recognition statistics regarding a speech recognition pass using the first transform.
  - 3. The system of claim 2, wherein the one or more processors are further programmed to determine whether to modify the first transform based at least partly on data regarding a reliability of speech recognition results from the speech recognition pass using the first transform.
  - 4. The system of claim 1, wherein the default transform comprises at least one of a general transform, a transform for males, a transform for females, or an identity transform.

5. A computer-implemented method comprising:
- receiving, by a spoken language processing system comprising one or more computing devices, audio data regarding a user utterance, wherein the audio data is received from a client device separate from the one or more computing devices;
  
  determining, based at least partly on the audio data, to replace a session feature vector transform associated with the client device with a new feature vector transform, wherein the session feature vector transform is based on data generated during processing of prior audio data regarding one or more utterances, made prior to the user utterance, received from the client device within a threshold period of time;
  
  creating the new feature vector transform based at least partly on the audio data, wherein the new feature vector transform is associated with the client device; and
  
  performing speech recognition using the new feature vector transform.
- View Dependent Claims (6, 7, 8, 9, 10, 11, 12, 13, 14, 16)
- - 6. The computer-implemented method of claim 5, further comprising receiving, from the client device, data regarding which feature vector transform of a plurality of feature vector transforms associated with the client device to use in performing speech recognition.
  - 7. The computer-implemented method of claim 6, wherein the data regarding which feature vector transform to use comprises speech processing alignments associated with at least a portion of audio data.
  - 8. The computer-implemented method of claim 5, further comprising receiving, from the client device, a plurality of feature vector transforms associated with the client device.
  - 9. The computer-implemented method of claim 5, further comprising performing speaker adaptation on at least a portion of the audio data with two or more of a plurality of feature vector transforms associated with the client device.
  - 10. The computer-implemented method of claim 9, further comprising performing speaker adaptation on at least a portion of the utterance audio data with a default transform.
  - 11. The computer-implemented method of claim 10, further comprising determining that the default transform provides better speech recognition results than the two or more of the plurality of feature vector transforms based at least partly on the speaker adaptation performed with the default transform and the two or more of the plurality of feature vector transforms.
  - 12. The computer-implemented method of claim 11, further comprising replacing a least-recently-used transform of the plurality of feature vector transforms with a modified version of the default transform.
  - 13. The computer-implemented method of claim 5, further comprising determining whether to store an updated version of the new feature vector transform is based at least partly on a user experience associated with the speech recognition.
  - 14. The computer-implemented method of claim 5, further comprising interpolating the new feature vector transform and a second transform associated with the client device according to interpolation weights, wherein the speech recognition is performed based at least partly on the interpolated new feature vector transform and second transform.
  - 16. The computer-implemented method of claim 5, further comprising:
    - performing first speech recognition on at least a portion of the audio data using the session feature vector transform;
      
      performing second speech recognition on at least the portion of the audio data using a default transform;
      
      determining that the default transform provides better speech recognition results than the session feature vector transform, wherein determining to replace the session feature vector transform with the new feature vector transform is based on determining that the default transform provides better speech recognition results than the session feature vector transform; and
      
      generating the new feature vector transform using the default transform.

15. One or more non-transitory computer readable media comprising executable code that, when executed, cause one or more computing devices to perform a process comprising:
- receiving, by a spoken language processing system comprising one or more computing devices, audio data regarding a user utterance, wherein the audio data is received from a client device separate from the one or more computing devices;
  
  determining, based at least partly on the audio data, to replace a session feature vector transform associated with the client device with a new feature vector transform, wherein the session feature vector transform is based on data generated during processing of prior audio data regarding one or more utterances, made prior to the user utterance, received from the client device within a threshold period of time;
  
  creating the new feature vector transform based at least partly on the audio data, wherein the new feature vector transform is associated with the client device; and
  
  performing speech recognition using the new feature vector transform.
- View Dependent Claims (17, 18, 19, 20, 21, 22, 23, 24, 25)
- - 17. The one or more non-transitory computer readable media of claim 15, wherein the process further comprises updating the new feature vector transform based at least partly on speech recognition statistics.
  - 18. The one or more non-transitory computer readable media of claim 15, wherein the process further comprises receiving, from the client device, data regarding which feature vector transform of a plurality of feature vector transforms associated with the client device to use in performing speech recognition.
  - 19. The one or more non-transitory computer readable media of claim 15, wherein the process further comprises receiving, from the client device, the session feature vector transform.
  - 20. The one or more non-transitory computer readable media of claim 15, wherein the process further comprises performing parallel speaker adaptation on at least a portion of the audio data using the session feature vector transform associated with the client device and a default transform.
  - 21. The one or more non-transitory computer readable media of claim 20, wherein determining to replace the session feature vector transform with the new feature vector transform comprises:
    - determining that the default transform provides better speech recognition results than the session feature vector transform,wherein the new feature vector transform comprises a modified version of the default transform.
  - 22. The one or more non-transitory computer readable media of claim 15, wherein the process further comprises interpolating the new feature vector transform and a second transform associated with the client device according to interpolation weights, wherein the speech recognition is performed based at least partly on the interpolated new feature vector transform and second transform.
  - 23. The one or more non-transitory computer readable media of claim 22, wherein the new feature vector transform comprises a session transform based on processing of utterances in a current session, and wherein the second transform comprises a multi-session transform based on processing of utterances in multiple sessions.
  - 24. The one or more non-transitory computer readable media of claim 22, wherein the process further comprises updating the interpolation weights based at least partly on results obtained from performing speech processing on the utterance.
  - 25. The one or more non-transitory computer readable media of claim 15, wherein the process further comprises determining whether to store an updated version of the new feature vector transform is based at least partly on a user experience associated with the speech recognition.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Original Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Inventors
Salvador, Stan Weidner, Secker-Walker, Hugh Evan, Ramakrishnan, Karthik, Yang, Shengbin
Primary Examiner(s)
Baker, Charlotte M

Application Number

US13/892,167
Time in Patent Office

956 Days
Field of Search

704/243, 704/246, 704/248, 704/256, 704/234, 704/231, 704/247, 704/251, 704/252, 704/244, 704/245, 704/256.2, 704/200
US Class Current

1/1
CPC Class Codes

G10L 15/02   Feature extraction for spee...

G10L 15/30   Distributed recognition, e....

G10L 15/32   Multiple recognisers used i...

Generation and use of multiple speech processing transforms

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

10 Citations

25 Claims

Specification

Solutions

Use Cases

Quick Links

Generation and use of multiple speech processing transforms

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

10 Citations

25 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links