Multi-microphone speech recognition systems and related techniques

US 10,013,981 B2
Filed: 06/06/2015
Issued: 07/03/2018
Est. Priority Date: 06/06/2015
Status: Active Grant

First Claim

Patent Images

1. A speech recognition system for resolving impaired utterances, comprising:

a speech recognition engine configured to receive a plurality of representations of an utterance corresponding to output from one or more microphone transducers in a plurality of microphone transducers exposed to the utterance, and further configured to determine, concurrently, a plurality of highest-likelihood transcription candidates, each corresponding to a respective representation of the utterance;

a selector configured to determine a most-likely accurate transcription of the utterance from among the plurality of highest-likelihood transcription candidates; and

an output device configured to output recognized speech corresponding to the most-likely accurate transcription of the utterance.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A speech recognition system for resolving impaired utterances can have a speech recognition engine configured to receive a plurality of representations of an utterance and concurrently to determine a plurality of highest-likelihood transcription candidates corresponding to each respective representation of the utterance. The recognition system can also have a selector configured to determine a most-likely accurate transcription from among the transcription candidates. As but one example, the plurality of representations of the utterance can be acquired by a microphone array, and beamforming techniques can generate independent streams of the utterance across various look directions using output from the microphone array.

59 Citations

View as Search Results

20 Claims

1. A speech recognition system for resolving impaired utterances, comprising:
- a speech recognition engine configured to receive a plurality of representations of an utterance corresponding to output from one or more microphone transducers in a plurality of microphone transducers exposed to the utterance, and further configured to determine, concurrently, a plurality of highest-likelihood transcription candidates, each corresponding to a respective representation of the utterance;
  
  a selector configured to determine a most-likely accurate transcription of the utterance from among the plurality of highest-likelihood transcription candidates; and
  
  an output device configured to output recognized speech corresponding to the most-likely accurate transcription of the utterance.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The speech recognition system according to claim 1, wherein the speech recognition engine comprises a plurality of constituent recognizers, each constituent recognizer being configured to receive a respective one of the plurality of representations of the utterance and to determine one or more corresponding transcription candidates corresponding to the received representation of the utterance, and, for each transcription candidate, to determine an associated likelihood of being an accurate transcription of the utterance.
  - 3. The speech recognition system according to claim 2, wherein each constituent recognizer generates intermediate transcription information corresponding to each respective transcription candidate determined by the respective constituent recognizer, wherein the recognition system further comprises a parameter updater configured to compare intermediate transcription information from among the constituent recognizers and to revise at least one intermediate transcription information responsive to a measure of quality of the at least one intermediate transcription information relative to a selected threshold measure of quality.
  - 4. The speech recognition system according to claim 1, wherein each of the highest-likelihood transcription candidates has a corresponding likelihood of being an accurate transcription of the utterance, and wherein the selector is configured to determine the most-likely accurate transcription based on a comparison of the corresponding likelihoods.
  - 5. The speech recognition system according to claim 4, wherein the selector is further configured to select a transcription candidate having a largest likelihood among the plurality of transcription candidates, a transcription candidate having a largest net likelihood among the plurality of transcription candidates, a transcription candidate having a largest frequency of being a highest-likelihood transcription candidate, or a transcription candidate having a highest cumulative rank order, wherein the rank order for each transcription candidate corresponding to a given representation corresponds to a relative likelihood of the respective transcription candidate compared to a likelihood of each other transcription candidate corresponding to the given utterance.
  - 6. The speech recognition system according to claim 1, wherein the speech recognition engine comprises a plurality of constituent recognizers, each constituent recognizer being configured to produce a corresponding ordered list of highest-likelihood transcription candidates and a likelihood measure for each respective transcription candidate.
  - 7. The speech recognition system according to claim 1, wherein the speech recognition engine comprises a plurality of constituent recognizers, each having a plurality of recognition stages, wherein each recognition stage is configured to extract a plurality of component candidates and intermediate transcription information corresponding to each component candidate.
  - 8. The speech recognition system according to claim 7, further comprising an updater configured to revise one or more selected component candidates based on a comparison of the intermediate transcription information corresponding to the one or more selected component candidates in relation to the intermediate transcription information corresponding to the other component candidates.

9. A speech recognition method comprising:
- receiving a plurality of representations of an utterance, wherein each representation corresponds to output from one or more microphone transducers among a plurality of microphone transducers exposed to the utterance;
  
  concurrently determining a plurality of highest-likelihood transcription candidates corresponding to each representation of the utterance;
  
  selecting a most-likely accurate transcription from among the transcription candidates corresponding to the plurality of representations of the utterance; and
  
  outputting recognized speech corresponding to the most-likely accurate transcription of the utterance.
- View Dependent Claims (10, 11, 12, 13)
- - 10. The speech recognition method according to claim 9, wherein the act of concurrently determining one or more highest-likelihood transcription candidates corresponding to each of the plurality of representations of the utterance comprises concurrently processing each representation of the utterance with a respective different constituent recognizer to determine one or more corresponding highest-likelihood transcription candidates and an associated likelihood of being an accurate transcription for each transcription candidate.
  - 11. The speech recognition method according according to claim 9, wherein the act of concurrently determining a plurality of highest-likelihood transcription candidates comprises producing with each of a plurality of constituent recognizers an ordered list of highest-likelihood transcription candidates and a corresponding likelihood measure for each respective transcription candidate.
  - 12. The speech recognition method according according to claim 11, wherein each constituent recognizer has a plurality of recognition stages , wherein each recognition stage is configured to extract a plurality of component candidates and intermediate transcription information corresponding to each component candidate, the method further comprising revising one or more selected component candidates based on a comparison of the intermediate transcription information corresponding to the one or more selected component candidates in relation to intermediate transcription information corresponding to the other component candidates.
  - 13. The speech recognition method according to claim 9, further comprising generating the plurality of representations of the utterance from an output signal from each of a plurality of microphone transducers, wherein the plurality of representations comprises a selected one or more of a recording from each in the plurality of microphone transducers, a linear combination of the output signals from the plurality of microphone transducers, a plurality of linear combinations of the output signals from the plurality of microphone transducers, a filtered version of the output signal from at least one of the microphone transducers, and a combination thereof.

14. A speech recognition method comprising:
- concurrently determining a plurality of highest-likelihood transcription candidates corresponding to each of a plurality of representations of an utterance; and
  
  selecting a most-likely accurate transcription from among the transcription candidates corresponding to the plurality of representations of the utterance, wherein each of the highest-likelihood transcription candidates has an associated likelihood of being an accurate transcription of the utterance, and the act of selecting the most-likely accurate transcription comprises selecting from among the transcription candidates a transcription candidate having a largest likelihood among the plurality of transcription candidates, a transcription candidate having a largest net likelihood among the plurality of transcription candidates, a transcription candidate having a largest frequency of being a highest likelihood transcription candidate from each representation of the utterance, or a transcription candidate having a highest cumulative rank order among the representations of the utterance, wherein a rank order for each transcription candidate corresponding to a given representation corresponds to the relative likelihood of the respective transcription candidate compared to a likelihood of each of the other transcription candidates corresponding to the given utterance.

15. A non-transitory, computer-readable media containing instructions that, when executed by a processor, cause a computing environment to perform a speech recognition method comprising:
- receiving a plurality of representations of an utterance, wherein each representation corresponds to output from one or more microphone transducers among a plurality of microphone transducers exposed to the utterance;
  
  concurrently determining a plurality of highest-likelihood transcription candidates corresponding to each representation of the utterance;
  
  selecting a most-likely accurate transcription from among the transcription candidates corresponding to the plurality of representations of the utterance; and
  
  outputting recognized speech corresponding to the most-likely accurate transcription of the utterance.
- View Dependent Claims (16, 17, 18, 19, 20)
- - 16. The non-transitory, computer-readable media according to claim 15 wherein the act of concurrently determining one or more highest-likelihood transcription candidates corresponding to each of the plurality of representations of the utterance comprises concurrently processing each representation of the utterance with a respective different constituent recognizer to determine one or more corresponding highest-likelihood transcription candidates and an associated likelihood of being an accurate transcription for each transcription candidate.
  - 17. The non-transitory, computer-readable media according to claim 16, wherein each of the highest-likelihood transcription candidates has an associated likelihood of being an accurate transcription of the utterance, and the act of selecting the most-likely accurate transcription comprises selecting from among the transcription candidates a transcription candidate having a largest likelihood among the plurality of transcription candidates, a transcription candidate having a largest net likelihood among the plurality of transcription candidates, a transcription candidate having a largest frequency of being a highest likelihood transcription candidate from each representation of the utterance, or a transcription candidate having a highest cumulative rank order among the representations of the utterance, wherein a rank order for each transcription candidate corresponding to a given representation corresponds to the relative likelihood of the respective transcription candidate compared to a likelihood of each of the other transcription candidates corresponding to the given utterance.
  - 18. The non-transitory, computer-readable media according to claim 15, wherein the act of concurrently determining a plurality of highest-likelihood transcription candidates comprises producing with each of a plurality of constituent recognizers an ordered list of highest-likelihood transcription candidates and a corresponding likelihood measure for each respective transcription candidate.
  - 19. The non-transitory, computer-readable media according to claim 18, wherein each constituent recognizer has a plurality of recognition stages , wherein each recognition stage is configured to extract a plurality of component candidates and intermediate transcription information corresponding to each component candidate, the method further comprising revising one or more selected component candidates based on a comparison of the intermediate transcription information corresponding to the one or more selected component candidates in relation to intermediate transcription information corresponding to the other component candidates.
  - 20. The non-transitory, computer-readable media according to claim 15, further comprising generating the plurality of representations of the utterance from an output signal from each of a plurality of microphone transducers, wherein the plurality of representations comprises a selected one or more of a recording from each in the plurality of microphone transducers, a linear combination of the output signals from the plurality of microphone transducers, a plurality of linear combinations of the output signals the plurality of microphone transducers, and a combination thereof.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Apple Inc.
Original Assignee
Apple Inc.
Inventors
Ramprashad, Sean A., Thornburg, Harvey D., Krishnaswamy, Arvindh, Lindahl, Aram M.
Primary Examiner(s)
SINGH, SATWANT K

Application Number

US14/732,711
Publication Number

US 20160358606A1
Time in Patent Office

1,123 Days
Field of Search

None
US Class Current
CPC Class Codes

G10L 15/20 Speech recognition techniqu...

G10L 15/32 Multiple recognisers used i...

Multi-microphone speech recognition systems and related techniques

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

59 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Multi-microphone speech recognition systems and related techniques

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

59 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links