Speech recognition with hierarchical networks

US 8,914,286 B1
Filed: 03/29/2012
Issued: 12/16/2014
Est. Priority Date: 04/14/2011
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method for performing speech recognition, comprising:

as implemented by one or more computing devices configured to execute specific instructions,receiving audio input;

recognizing a first word of an utterance using the audio input, a word network, and a speech unit network, wherein the word network is associated with a language model and the speech unit network is associated with a first speech unit model that comprises pronunciations of words using speech units; and

subsequent to recognizing the first word of the utterance;

determining information about an accent of a speaker of the audio input using the audio input;

obtaining a second speech unit model using the information;

associating the second speech unit model with the speech unit network;

recognizing a second word of the same utterance using the second speech unit model; and

generating speech recognition results comprising the first word, recognized using the first speech unit model, and the second word, recognized using the second speech unit model.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Provided are systems and methods for using hierarchical networks for recognition, such as speech recognition. Conventional automatic recognition systems may not be both efficient and flexible. Recognition systems are disclosed that may achieve efficiency and flexibility by employing hierarchical networks, prefix consolidation of networks, and future consolidation of networks. The disclosed networks may be associated with a network model and the associated network model may be modified during recognition to achieve greater flexibility.

Citations

24 Claims

1. A computer-implemented method for performing speech recognition, comprising:
- as implemented by one or more computing devices configured to execute specific instructions,receiving audio input;
  
  recognizing a first word of an utterance using the audio input, a word network, and a speech unit network, wherein the word network is associated with a language model and the speech unit network is associated with a first speech unit model that comprises pronunciations of words using speech units; and
  
  subsequent to recognizing the first word of the utterance;
  
  determining information about an accent of a speaker of the audio input using the audio input;
  
  obtaining a second speech unit model using the information;
  
  associating the second speech unit model with the speech unit network;
  
  recognizing a second word of the same utterance using the second speech unit model; and
  
  generating speech recognition results comprising the first word, recognized using the first speech unit model, and the second word, recognized using the second speech unit model.
- View Dependent Claims (2, 3)
- - 2. The computer-implemented method of claim 1, wherein obtaining a second speech unit model using the information comprises selecting the second speech unit model from a plurality of speech unit models, wherein each of the plurality of speech unit models is associated with an accent.
  - 3. The computer-implemented method of claim 1, wherein the first speech unit model comprises pronunciations of words with an American English accent and the second speech unit model comprises pronunciations of words using a British English accent.

4. A computer-implemented method, comprising:
- as implemented by one or more computing devices configured to execute specific instructions,receiving input;
  
  recognizing a first token of an utterance using the input, a first network, and a second network, wherein the first network is associated with a first model and the second network is associated with a second model; and
  
  subsequent to recognizing the first token of the utterance;
  
  determining a characteristic using the input;
  
  obtaining a substitute model using the characteristic;
  
  associating the substitute model with the second network;
  
  recognizing a second token of the same utterance using the substitute model; and
  
  generating speech recognition results based at least partly on the first token, recognized using the second model, and the second token, recognized using the substitute model.
- View Dependent Claims (5, 6, 7, 8, 9, 10, 11, 12)
- - 5. The computer-implemented method of claim 4, wherein:
    - the input comprises audio input; and
      
      determining a characteristic using the input comprises determining a characteristic of a speaker of the audio input using the audio input.
  - 6. The computer-implemented method of claim 4, wherein the first network is a word network, and the word network comprises a plurality of word tokens indicating words that can be recognized from the input.
  - 7. The computer-implemented method of claim 4, wherein the first model is a grammar indicating sequences of words that can be recognized from the input.
  - 8. The computer-implemented method of claim 4, wherein recognizing a first token further comprises using a third network, wherein the third network is associated with a third model.
  - 9. The computer-implemented method of claim 4, wherein obtaining a substitute model using the characteristic comprises creating the substitute model by modifying the second model.
  - 10. The computer-implemented method of claim 9, wherein the second model is an acoustic model and wherein modifying the second model comprises selecting a third model using the characteristic and modifying the second model by interpolating between the second model and the third model.
  - 11. The computer-implemented method of claim 4, wherein the first token represents a phoneme.
  - 12. The computer-implemented method of claim 4, wherein obtaining a substitute model using the characteristic comprises selecting a speech unit model based on a speaker'"'"'s accent.

13. A computer readable, non-transitory storage medium having computer executable instructions for performing a method, comprising:
- receiving audio input;
  
  recognizing a first token of an utterance using the audio input, a first network, and a second network, wherein the first network is associated with a first model and the second network is associated with a second model; and
  
  subsequent to recognizing the first token;
  
  determining a characteristic of a speaker of the audio input using the audio input;
  
  obtaining a substitute model using the characteristic;
  
  associating the substitute model with the second network;
  
  recognizing a second token of the same utterance using the substitute model; and
  
  generating speech recognition results based at least partly on the first token, recognized using the second model, and the second token, recognized using the substitute model.
- View Dependent Claims (14, 15, 16, 17, 18)
- - 14. The computer readable, non-transitory storage medium of claim 13, wherein the second network is a speech unit network, and the speech unit network comprises a plurality of speech unit tokens indicating speech units that can be recognized from the audio input.
  - 15. The computer readable, non-transitory storage medium of claim 14, wherein the second model comprises a lexicon indicating pronunciations of words using the speech units.
  - 16. The computer readable, non-transitory storage medium of claim 13, wherein recognizing a second token further comprises using a third network, wherein the third network is associated with a third model.
  - 17. The computer readable, non-transitory storage medium of claim 13, wherein obtaining a substitute model using the characteristic comprises creating a language model using language model interpolation.
  - 18. The computer readable, non-transitory storage medium of claim 13, wherein obtaining a substitute model using the characteristic comprises selecting a hidden Markov model based on a speaking rate of a speaker.

19. A system comprising a computing device configured to:
- receive input;
  
  recognize a first token of an utterance using the input, a first network, and a second network, wherein the first network is associated with a first model and the second network is associated with a second model; and
  
  subsequent to recognizing the first token of the utterance;
  
  determine a characteristic using the input;
  
  obtain a substitute model using the characteristic;
  
  associate the substitute model with the first network;
  
  recognize a second token of the same utterance using the substitute model; and
  
  generate speech recognition results based at least partly on the first token, recognized using the first model, and the second token, recognized using the substitute model.
- View Dependent Claims (20, 21, 22, 23, 24)
- - 20. The system of claim 19, wherein the first network is a speech unit in context network.
  - 21. The system of claim 19, wherein the second model is a pronunciation model.
  - 22. The system of claim 19, wherein the system is further configured to recognize a second token by using a third network, wherein the third network is associated with a third model.
  - 23. The system of claim 19, wherein the system is further configured to obtain a substitute model by selecting a model from a plurality of available models.
  - 24. The system of claim 19, wherein the system is further configured to obtain a substitute model using the characteristic by selecting a gaussian mixture model based on a signal-to-noise ratio.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Original Assignee
Canyon IP Holdings LLC (Intellectual Ventures LLC)
Inventors
Krishnamoorthy, Mahesh, Secker-Walker, Hugh, Basye, Kenneth J.
Primary Examiner(s)
Godbold, Douglas

Application Number

US13/434,159
Time in Patent Office

992 Days
Field of Search

704/231, 704/232, 704/235, 704/243, 704/244, 704256-2568
US Class Current

704/244
CPC Class Codes

G10L 15/00   Speech recognition G10L17/0...

G10L 15/005   Language recognition

G10L 15/063   Training

G10L 15/083   Recognition networks G10L15...

G10L 15/142   Hidden Markov Models [HMMs]

G10L 15/187   Phonemic context, e.g. pron...

G10L 15/193   Formal grammars, e.g. finit...

Speech recognition with hierarchical networks

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

24 Claims

Specification

Solutions

Use Cases

Quick Links

Speech recognition with hierarchical networks

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

24 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links