Automatic speech recognition based on user feedback

US 10,446,141 B2
Filed: 01/07/2015
Issued: 10/15/2019
Est. Priority Date: 08/28/2014
Status: Active Grant

First Claim

Patent Images

1. A method for processing speech in a digital assistant, the method comprising:

at an electronic device with a processor and memory storing one or more programs for execution by the processor;

receiving, from a network interface, a first speech input;

processing the first speech input using a first automatic speech recognition system to produce a first speech recognition result;

performing a first task corresponding to a first user intent determined from the first speech recognition result;

upon performing the first task, receiving, from the network interface, an input representing a rejection of the first task;

in response to receiving the input, providing a prompt seeking a repetition of at least a portion of the first speech input;

receiving, from the network interface, a second speech input;

in accordance with the received input representing a rejection of the first task, processing the second speech input using a second automatic speech recognition system to produce a second speech recognition result, wherein the first automatic speech recognition system includes one or more speech recognition models, and the second automatic speech recognition system includes one or more speech recognition models that are different from the one or more speech recognition models of the first automatic speech recognition system;

determining a combined speech recognition result based on the first speech recognition result and the second speech recognition result; and

performing a second task corresponding to a second user intent determined from the combined speech recognition result.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems and processes for processing speech in a digital assistant are provided. In one example process, a first speech input can be received from a user. The first speech input can be processed using a first automatic speech recognition system to produce a first recognition result. An input indicative of a potential error in the first recognition result can be received. The input can be used to improve the first recognition result. For example, the input can include a second speech input that is a repetition of the first speech input. The second speech input can be processed using a second automatic speech recognition system to produce a second recognition result.

Citations

50 Claims

1. A method for processing speech in a digital assistant, the method comprising:
- at an electronic device with a processor and memory storing one or more programs for execution by the processor;
  
  receiving, from a network interface, a first speech input;
  
  processing the first speech input using a first automatic speech recognition system to produce a first speech recognition result;
  
  performing a first task corresponding to a first user intent determined from the first speech recognition result;
  
  upon performing the first task, receiving, from the network interface, an input representing a rejection of the first task;
  
  in response to receiving the input, providing a prompt seeking a repetition of at least a portion of the first speech input;
  
  receiving, from the network interface, a second speech input;
  
  in accordance with the received input representing a rejection of the first task, processing the second speech input using a second automatic speech recognition system to produce a second speech recognition result, wherein the first automatic speech recognition system includes one or more speech recognition models, and the second automatic speech recognition system includes one or more speech recognition models that are different from the one or more speech recognition models of the first automatic speech recognition system;
  
  determining a combined speech recognition result based on the first speech recognition result and the second speech recognition result; and
  
  performing a second task corresponding to a second user intent determined from the combined speech recognition result.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The method of claim 1, wherein the input is a speech input that includes a predetermined utterance.
  - 3. The method of claim 1, wherein the input comprises a selection of an affordance.
  - 4. The method of claim 1, wherein at least a portion of text of the first speech recognition result is displayed on the electronic device, and wherein the input comprises a selection of at least a portion of the displayed text.
  - 5. The method of claim 1, further comprising:
    - in accordance with receiving the input, identifying a portion of the first speech input corresponding to a potential error in the first speech recognition result.
  - 6. The method of claim 5, wherein processing the first speech input using the first automatic speech recognition system includes determining a confidence measure of each word in a text of the first speech recognition result, and wherein the portion of the first speech input associated with the potential error is identified based on the confidence measure of each word in the text.
  - 7. The method of claim 5, wherein the prompt includes a request to repeat the identified portion of the first speech input corresponding to the potential error.
  - 8. The method of claim 1, wherein the combined result is determined by performing automatic speech recognition system combination using the first speech recognition result and the second speech recognition result.
  - 9. The method of claim 1, wherein the second automatic speech recognition system is associated with a greater computation cost than the first automatic speech recognition system in order to achieve greater accuracy.

10. A method for processing speech in a digital assistant, the method comprising:
- at an electronic device with a processor and memory storing one or more programs for execution by the processor;
  
  receiving an input containing user speech;
  
  processing the input using a first automatic speech recognition system to produce a first speech recognition result;
  
  performing a first task corresponding to a first user intent determined from the first speech recognition result;
  
  upon performing the first task, receiving a second input representing a rejection of the first task;
  
  in response to receiving the second input, processing at least a portion of the audio signal using a second automatic speech recognition system to produce a second speech recognition result, wherein the first automatic speech recognition system includes one or more speech recognition models, and the second automatic speech recognition system includes one or more speech recognition models that are different from the one or more speech recognition models of the first automatic speech recognition system;
  
  determining a combined speech recognition result based on the first speech recognition result and the second speech recognition result; and
  
  performing a second task corresponding to a second user intent determined from the combined speech recognition result.
- View Dependent Claims (11, 12, 13, 14)
- - 11. The method of claim 10, wherein an error rate of the second automatic speech recognition system is lower than an error rate of the first automatic speech recognition system.
  - 12. The method of claim 10, wherein a latency of the second automatic speech recognition system is greater than a latency of the first automatic speech recognition system.
  - 13. The method of claim 10, wherein the combined result is determined by performing automatic speech recognition system combination using the first speech recognition result and the second speech recognition result.
  - 14. The method of claim 13, wherein performing automatic speech recognition system combination comprises implementing at least one of recognition output voting error reduction, cross-adaptation, confusion network combination, and lattice combination.

15. A non-transitory computer-readable storage medium storing one or more programs configured to be executed by one or more processors of an electronic device, the one or more programs comprising instructions for:
- receiving, from a network interface, a first speech input;
  
  processing the first speech input using a first automatic speech recognition system to produce a first speech recognition result;
  
  performing a first task corresponding to a first user intent determined from the first speech recognition result;
  
  receiving, from the network interface, a second speech input;
  
  determining whether a phonemic transcription of the second speech input has an error rate that is less than a predetermined value when compared against a phonemic transcription of a corresponding portion of the first speech input;
  
  in response to determining that the phonemic transcription of the second speech input has an error rate that is less than the predetermined value when compared against the phonemic transcription of a corresponding portion of the first speech input, processing the second speech input using a second automatic speech recognition system to produce a second speech recognition result; and
  
  performing a second task corresponding to a second user intent determined based on the second speech recognition result.
- View Dependent Claims (16, 17, 18, 19)
- - 16. The non-transitory computer-readable storage medium of claim 15, wherein the first automatic speech recognition system includes one or more speech recognition models, and the second automatic speech recognition system includes one or more speech recognition models that are different from the one or more speech recognition models of the first automatic speech recognition system.
  - 17. The non-transitory computer-readable storage medium of claim 15, wherein the one or more programs further including instructions for:
    - determining a combined speech recognition result based on the first speech recognition result and the second speech recognition result, wherein the second user intent is determined further based on the combined speech recognition result.
  - 18. The non-transitory computer-readable storage medium of claim 17, wherein the combined speech recognition result is determined by performing automatic speech recognition system combination using the first speech recognition result and the second speech recognition result.
  - 19. The non-transitory computer-readable storage medium of claim 15, wherein a latency of the second automatic speech recognition system is greater than a latency of the first automatic speech recognition system.

20. An electronic device comprising:
- one or more processors;
  
  memory;
  
  one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs including instructions for;
  
  receiving, from a network interface, a first speech input;
  
  processing the first speech input using a first automatic speech recognition system to produce a first speech recognition result;
  
  performing a first task corresponding to a first user intent determined from the first speech recognition result;
  
  receiving, from the network interface, a second speech input;
  
  determining whether a phonemic transcription of the second speech input has an error rate that is less than a predetermined value when compared against a phonemic transcription of a corresponding portion of the first speech input;
  
  in response to determining that the phonemic transcription of the second speech input has an error rate that is less than the predetermined value when compared against the phonemic transcription of a corresponding portion of the first speech input, processing the second speech input using a second automatic speech recognition system to produce a second speech recognition result; and
  
  performing a second task corresponding to a second user intent determined based on the second speech recognition result.
- View Dependent Claims (21, 22, 23, 24)
- - 21. The device of claim 20, wherein the first automatic speech recognition system includes one or more speech recognition models, and the second automatic speech recognition system includes one or more speech recognition models that are different from the one or more speech recognition models of the first automatic speech recognition system.
  - 22. The device of claim 20, wherein the one or more programs further including instructions for:
    - determining a combined speech recognition result based on the first speech recognition result and the second speech recognition result, wherein the second user intent is determined further based on the combined speech recognition result.
  - 23. The device of claim 22, wherein the combined speech recognition result is determined by performing automatic speech recognition system combination using the first speech recognition result and the second speech recognition result.
  - 24. The device of claim 20, wherein a latency of the second automatic speech recognition system is greater than a latency of the first automatic speech recognition system.

25. A method for processing speech in a digital assistant, the method comprising:
- at an electronic device with a processor and memory storing one or more programs for execution by the processor;
  
  receiving, from a network interface, a first speech input;
  
  processing the first speech input using a first automatic speech recognition system to produce a first speech recognition result;
  
  performing a first task corresponding to a first user intent determined from the first speech recognition result;
  
  receiving, from the network interface, a second speech input;
  
  determining whether a phonemic transcription of the second speech input has an error rate that is less than a predetermined value when compared against a phonemic transcription of a corresponding portion of the first speech input;
  
  in response to determining that the phonemic transcription of the second speech input has an error rate that is less than the predetermined value when compared against the phonemic transcription of a corresponding portion of the first speech input, processing the second speech input using a second automatic speech recognition system to produce a second speech recognition result; and
  
  performing a second task corresponding to a second user intent determined based on the second speech recognition result.
- View Dependent Claims (26, 27, 28, 29)
- - 26. The method of claim 25, wherein the first automatic speech recognition system includes one or more speech recognition models, and the second automatic speech recognition system includes one or more speech recognition models that are different from the one or more speech recognition models of the first automatic speech recognition system.
  - 27. The method of claim 25, further comprising:
    - determining a combined speech recognition result based on the first speech recognition result and the second speech recognition result, wherein the second user intent is determined further based on the combined speech recognition result.
  - 28. The method of claim 27, wherein the combined speech recognition result is determined by performing automatic speech recognition system combination using the first speech recognition result and the second speech recognition result.
  - 29. The method of claim 25, wherein a latency of the second automatic speech recognition system is greater than a latency of the first automatic speech recognition system.

30. An electronic device, comprising:
- one or more processors;
  
  a memory; and
  
  one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs including instructions for;
  
  receiving, from a network interface, a first speech input;
  
  processing the first speech input using a first automatic speech recognition system to produce a first speech recognition result;
  
  performing a first task corresponding to a first user intent determined from the first speech recognition result;
  
  upon performing the first task, receiving, from the network interface, an input representing a rejection of the first task;
  
  in response to receiving the input, providing a prompt seeking a repetition of at least a portion of the first speech input;
  
  receiving, from the network interface, a second speech input;
  
  in accordance with the received input representing a rejection of the first task, processing the second speech input using a second automatic speech recognition system to produce a second speech recognition result, wherein the first automatic speech recognition system includes one or more speech recognition models, and the second automatic speech recognition system includes one or more speech recognition models that are different from the one or more speech recognition models of the first automatic speech recognition system;
  
  determining a combined speech recognition result based on the first speech recognition result and the second speech recognition result; and
  
  performing a second task corresponding to a second user intent determined from the combined speech recognition result.
- View Dependent Claims (31, 32, 33, 34, 35)
- - 31. The electronic device of claim 30, wherein the input is a speech input that includes a predetermined utterance.
  - 32. The electronic device of claim 30, wherein the input comprises a selection of an affordance.
  - 33. The electronic device of claim 30, wherein at least a portion of text of the first speech recognition result is displayed on the electronic device, and wherein the input comprises a selection of at least a portion of the displayed text.
  - 34. The electronic device of claim 30, wherein the one or more programs further comprise instructions for:
    - in accordance with receiving the input, identifying a portion of the first speech input corresponding to a potential error in the first speech recognition result.
  - 35. The electronic device of claim 34, wherein processing the first speech input using the first automatic speech recognition system includes determining a confidence measure of each word in a text of the first speech recognition result, and wherein the portion of the first speech input associated with the potential error is identified based on the confidence measure of each word in the text.

36. A non-transitory computer-readable storage medium storing one or more programs configured to be executed by one or more processors of an electronic device, the one or more programs comprising instructions for:
- receiving, from a network interface, a first speech input;
  
  processing the first speech input using a first automatic speech recognition system to produce a first speech recognition result;
  
  performing a first task corresponding to a first user intent determined from the first speech recognition result;
  
  upon performing the first task, receiving, from the network interface, an input representing a rejection of the first task;
  
  in response to receiving the input, providing a prompt seeking a repetition of at least a portion of the first speech input;
  
  receiving, from the network interface, a second speech input;
  
  in accordance with the received input representing a rejection of the first task, processing the second speech input using a second automatic speech recognition system to produce a second speech recognition result, wherein the first automatic speech recognition system includes one or more speech recognition models, and the second automatic speech recognition system includes one or more speech recognition models that are different from the one or more speech recognition models of the first automatic speech recognition system;
  
  determining a combined speech recognition result based on the first speech recognition result and the second speech recognition result; and
  
  performing a second task corresponding to a second user intent determined from the combined speech recognition result.
- View Dependent Claims (37, 38, 39, 40)
- - 37. The computer readable storage medium of claim 36, wherein the input is a speech input that includes a predetermined utterance.
  - 38. The computer readable storage medium of claim 36, wherein the input comprises a selection of an affordance.
  - 39. The computer readable storage medium of claim 36, wherein at least a portion of text of the first speech recognition result is displayed on the electronic device, and wherein the input comprises a selection of at least a portion of the displayed text.
  - 40. The computer readable storage medium of claim 36, wherein the one or more programs further comprise instructions for:
    - in accordance with receiving the input, identifying a portion of the first speech input corresponding to a potential error in the first speech recognition result.

41. An electronic device, comprising:
- one or more processors;
  
  a memory; and
  
  one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs including instructions for;
  
  receiving an input containing user speech;
  
  processing the input using a first automatic speech recognition system to produce a first speech recognition result;
  
  performing a first task corresponding to a first user intent determined from the first speech recognition result;
  
  ’
  
  upon performing the first task, receiving a second input representing a rejection of the first task;
  
  in response to receiving the second input, processing at least a portion of the-audio signal using a second automatic speech recognition system to produce a second speech recognition result, wherein the first automatic speech recognition system includes one or more speech recognition models, and the second automatic speech recognition system includes one or more speech recognition models that are different from the one or more speech recognition models of the first automatic speech recognition system;
  
  determining a combined speech recognition result based on the first speech recognition result and the second speech recognition result; and
  
  performing a second task corresponding to a second user intent determined from the combined speech recognition result.
- View Dependent Claims (42, 43, 44, 45)
- - 42. The electronic device of claim 41, wherein an error rate of the second automatic speech recognition system is lower than an error rate of the first automatic speech recognition system.
  - 43. The electronic device of claim 41, wherein a latency of the second automatic speech recognition system is greater than a latency of the first automatic speech recognition system.
  - 44. The electronic device of claim 41, wherein the combined result is determined by performing automatic speech recognition system combination using the first speech recognition result and the second speech recognition result.
  - 45. The electronic device of claim 42, wherein performing automatic speech recognition system combination comprises implementing at least one of recognition output voting error reduction, cross-adaptation, confusion network combination, and lattice combination.

46. A non-transitory computer-readable storage medium storing one or more programs configured to be executed by one or more processors of an electronic device, the one or more programs comprising instructions for:
- receiving an input containing user speech;
  
  processing the input using a first automatic speech recognition system to produce a first speech recognition result;
  
  performing a first task corresponding to a first user intent determined from the first speech recognition result;
  
  upon performing the first task, receiving a second input representing a rejection of the first task;
  
  in response to receiving the second input, processing at least a portion of the-audio signal using a second automatic speech recognition system to produce a second speech recognition result, wherein the first automatic speech recognition system includes one or more speech recognition models, and the second automatic speech recognition system includes one or more speech recognition models that are different from the one or more speech recognition models of the first automatic speech recognition system;
  
  determining a combined speech recognition result based on the first speech recognition result and the second speech recognition result; and
  
  performing a second task corresponding to a second user intent determined from the combined speech recognition result.
- View Dependent Claims (47, 48, 49, 50)
- - 47. The computer readable storage medium of claim 46, wherein an error rate of the second automatic speech recognition system is lower than an error rate of the first automatic speech recognition system.
  - 48. The computer readable storage medium of claim 46, wherein a latency of the second automatic speech recognition system is greater than a latency of the first automatic speech recognition system.
  - 49. The computer readable storage medium of claim 46, wherein the combined result is determined by performing automatic speech recognition system combination using the first speech recognition result and the second speech recognition result.
  - 50. The computer readable storage medium of claim 49, wherein performing automatic speech recognition system combination comprises implementing at least one of recognition output voting error reduction, cross-adaptation, confusion network combination, and lattice combination.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Apple Inc.
Original Assignee
Apple Inc.
Inventors
Krishnamoorthy, Mahesh, Paulik, Matthias
Primary Examiner(s)
Ky, Kevin

Application Number

US14/591,754
Publication Number

US 20160063998A1
Time in Patent Office

1,742 Days
Field of Search

None
US Class Current
CPC Class Codes

G10L 15/01   Assessment or evaluation of...

G10L 15/02   Feature extraction for spee...

G10L 15/22   Procedures used during a sp...

G10L 15/32   Multiple recognisers used i...

G10L 2015/025   Phonemes, fenemes or fenone...

Automatic speech recognition based on user feedback

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

50 Claims

Specification

Solutions

Use Cases

Quick Links

Automatic speech recognition based on user feedback

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

50 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links