Duration ratio modeling for improved speech recognition

US 9,542,939 B1
Filed: 08/31/2012
Issued: 01/10/2017
Est. Priority Date: 08/31/2012
Status: Active Grant

First Claim

Patent Images

1. A method of recognizing speech in audio data, the method comprising:

receiving audio data representing speech, wherein the receiving is performed by an automated speech recognition (ASR) device configured to convert the audio data to text data, the ASR device comprising an ASR module;

transforming, by the ASR module, the audio data into one or more feature vectors representing the speech;

identifying, by the ASR module and using a portion of the one or more feature vectors, a sequence of phonemes represented in a portion of the audio data;

determining, by the ASR module, a first duration of the sequence of phonemes;

determining, by the ASR module, a second duration of a single phoneme within the sequence of phonemes;

determining, by the ASR module, a duration score of the single phoneme, wherein the duration score is determined using the second duration in relation to the first duration;

determining, by the ASR module, a recognition score based at least in part on the duration score;

determining, by the ASR module, a speech recognition result based at least in part upon the recognition score, wherein the speech recognition result is the text data corresponding to the speech; and

causing a command to be executed using the text data.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

In speech recognition, the duration of a phoneme is taken into account when determining recognition scores. Specifically, the duration of a phoneme may be evaluated relative to the duration of neighboring phonemes. A phoneme that is interpreted to be significantly longer or shorter than its neighbors may be given a lower duration score. A duration score for a phoneme may be calculated and used to adjust a recognition score. In this manner a duration model may supplement an acoustic model and language model to improve speech recognition results.

23 Citations

View as Search Results

27 Claims

1. A method of recognizing speech in audio data, the method comprising:
- receiving audio data representing speech, wherein the receiving is performed by an automated speech recognition (ASR) device configured to convert the audio data to text data, the ASR device comprising an ASR module;
  
  transforming, by the ASR module, the audio data into one or more feature vectors representing the speech;
  
  identifying, by the ASR module and using a portion of the one or more feature vectors, a sequence of phonemes represented in a portion of the audio data;
  
  determining, by the ASR module, a first duration of the sequence of phonemes;
  
  determining, by the ASR module, a second duration of a single phoneme within the sequence of phonemes;
  
  determining, by the ASR module, a duration score of the single phoneme, wherein the duration score is determined using the second duration in relation to the first duration;
  
  determining, by the ASR module, a recognition score based at least in part on the duration score;
  
  determining, by the ASR module, a speech recognition result based at least in part upon the recognition score, wherein the speech recognition result is the text data corresponding to the speech; and
  
  causing a command to be executed using the text data.
- View Dependent Claims (2, 3)
- - 2. The method of claim 1, wherein the sequence of phonemes comprises at least two phonemes in the sequence prior to the single phoneme and at least two phonemes in the sequence subsequent to the single phoneme.
  - 3. The method of claim 1, wherein determining the duration score comprises computing a ratio of the second duration to the first duration.

4. A method, comprising:
- receiving audio data, wherein the receiving is performed by at least one automated speech recognition (ASR) device configured to convert the audio data into text data, the at least one ASR device comprising at least one ASR module;
  
  determining, by the at least one ASR module, a sequence of speech units represented in a portion of the audio data;
  
  determining, by the at least one ASR module, a first duration of the sequence of speech units;
  
  determining, by the at least one ASR module and using the first duration, an expected duration of a single speech unit in the sequence of speech units;
  
  determining, by the at least one ASR module, a second duration of the single speech unit;
  
  determining, by the at least one ASR module, a duration score of the single speech unit, the duration score corresponding to the second duration in relation to the expected duration;
  
  determining, by the at least one ASR module, a speech recognition result based at least in part on the duration score of the single speech unit, wherein the speech recognition result is the text data corresponding to the received audio data representing speech; and
  
  causing a command to be executed using the text data.
- View Dependent Claims (5, 6, 7, 8, 9, 10, 11)
- - 5. The method of claim 4, wherein determining the duration score comprises calculating the duration score using at least one of a chi-squared model or a mixture model.
  - 6. The method of claim 4, wherein the duration score is further based at least part on at least one of an absolute duration of the single speech unit, a duration ratio of a neighboring speech unit, or the second duration compared to a third duration of an utterance.
  - 7. The method of claim 4, wherein the single speech unit is one of a phoneme, a triphone, or a quinphone.
  - 8. The method of claim 4, wherein determining the first duration comprises excluding at least one speech unit in the sequence of speech units when determining the first duration.
  - 9. The method of claim 4, wherein the duration score is based at least in part on a ratio of the second duration to the first duration.
  - 10. The method of claim 4, further comprising normalizing the duration score using a number of speech units in an utterance comprising the single speech unit.
  - 11. The method of claim 10, further comprising:
    - determining a respective duration score for each speech unit in an utterance including the single speech unit; and
      
      wherein the normalizing the duration score comprises multiplying the determined respective duration scores of each speech unit in the utterance together and taking an Nth root of a result of the multiplying, where N is a number of speech units in the utterance.

12. A computing device configured to convert audio data to text data, comprising:
- an audio capture device configured to receive audio data, the received audio data representing spoken utterances;
  
  an automatic speech recognition (ASR) module configured to transform the received audio data into a sequence of speech units represented in the audio data;
  
  at least one processor;
  
  a memory device including instructions operable to be executed by the at least one processor to perform a set of actions, configuring the at least one processor;
  
  to determine a first duration of the sequence of speech units of the received audio data;
  
  to determine, using the first duration, an expected duration of a single speech unit of the received audio data in the sequence of speech units;
  
  to determine, by the ASR module, a second duration of the single speech unit;
  
  to determine a duration score of the single speech unit, the duration score corresponding to the second duration in relation to the expected duration;
  
  to determine a speech recognition result based at least in part on the duration score of the single speech unit, wherein the speech recognition result is the text data corresponding to the received audio data representing speech; and
  
  to cause a command to be executed using the text data.
- View Dependent Claims (13, 14, 15, 16, 17, 18, 19)
- - 13. The computing device of claim 12, wherein the at least one processor is further configured to compute the duration score using at least one of a chi-squared model or a mixture model.
  - 14. The computing device of claim 12, wherein the duration score is further based at least part on at least one of an absolute duration of the single speech unit, a duration ratio of a neighboring speech unit, or the second duration compared to a third duration of an utterance.
  - 15. The computing device of claim 12, wherein the single speech unit is one of a phoneme, a triphone, or a quinphone.
  - 16. The computing device of claim 12, wherein the at least one processor is further configured to exclude at least one speech unit in the sequence of speech units when determining the first duration.
  - 17. The computing device of claim 12, wherein the duration score is based at least in part on a ratio of the second duration to the first duration.
  - 18. The computing device of claim 12, wherein the at least one processor is further configured to normalize the duration score using a number of speech units in an utterance comprising the single speech unit.
  - 19. The computing device of claim 18, wherein the at least one processor is further configured to determine a respective duration score for each speech unit in an utterance including the first speech unit;
    - andwherein the at least one processor configured to normalize the duration score comprises the at least one processor configured to multiply the determined respective duration scores of each speech unit in the utterance together and taking an Nth root of a result of the multiplying, where N is a number of speech units in the utterance.

20. A non-transitory computer-readable storage medium storing processor-executable instructions for controlling a computing device, comprising:
- program code to cause an automated speech recognition (ASR) module, configured to convert audio data to text data, of the computing device to transform the audio data received by the computing device into a sequence of speech units represented in the audio data;
  
  program code to determine a first duration of the sequence of speech units of the received audio data;
  
  program code to determine, using the first duration, an expected duration of a single speech unit of the received audio data in the sequence of speech units;
  
  program code to determine, by the ASR module, a second duration of the single speech unit;
  
  program code to determine a duration score of the single speech unit, the duration score corresponding to the second duration in relation to the expected duration; and
  
  program code to determine a speech recognition result based at least in part on the duration score of the single speech unit, wherein the speech recognition result is the text data corresponding to the received audio data representing speech; and
  
  program code to cause a command to be executed using the text data.
- View Dependent Claims (21, 22, 23, 24, 25, 26, 27)
- - 21. The non-transitory computer-readable storage medium of claim 20, wherein the program code to determine the duration score comprises program code to compute the duration score using at least one of a chi-squared model or a mixture model.
  - 22. The non-transitory computer-readable storage medium of claim 20, wherein the duration score is further based at least part on at least one of an absolute duration of the single speech unit, a duration ratio of a neighboring speech unit, or the second duration compared to a third duration of an utterance.
  - 23. The non-transitory computer-readable storage medium of claim 20 wherein the single speech unit is one of a phoneme, a triphone, or a quinphone.
  - 24. The non-transitory computer-readable storage medium of claim 20, in which the program code to determine the first duration comprises program code to exclude at least one speech unit in the sequence of speech units when determining the first duration.
  - 25. The non-transitory computer-readable storage medium of claim 20, wherein the duration score is based at least in part on a ratio of the second duration to the first duration.
  - 26. The non-transitory computer-readable storage medium of claim 20, further comprising program code to normalize the duration score using a number of speech units in an utterance comprising the single speech unit.
  - 27. The non-transitory computer-readable storage medium of claim 26, further comprising:
    - program code to determine a respective duration score for each speech unit in an utterance including the first speech unit; and
      
      wherein the program code to normalize the duration score comprises program code to multiply the determined respective duration scores of each speech unit in the utterance together and take an Nth root of a result of the multiplying, where N is a number of speech units in the utterance.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Original Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Inventors
Hoffmeister, Bjorn
Primary Examiner(s)
JACKSON, JAKIEDA R

Application Number

US13/600,851
Time in Patent Office

1,593 Days
Field of Search
US Class Current

1/1
CPC Class Codes

G10L 15/02   Feature extraction for spee...

G10L 15/148   Duration modelling in HMMs,...

G10L 2015/025   Phonemes, fenemes or fenone...

Duration ratio modeling for improved speech recognition

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

23 Citations

27 Claims

Specification

Solutions

Use Cases

Quick Links

Duration ratio modeling for improved speech recognition

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

23 Citations

27 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links