DISAMBIGUATING HETERONYMS IN SPEECH SYNTHESIS

US 20160163312A1
Filed: 12/12/2014
Published: 06/09/2016
Est. Priority Date: 12/09/2014
Status: Active Grant

First Claim

Patent Images

1. A method for operating an intelligent automated assistant, the method comprising:

at an electronic device with a processor and memory storing one or more programs for execution by the processor;

receiving, from a user, a speech input containing a heteronym and one or more additional words;

processing the speech input using an automatic speech recognition system to determine at least one of;

a phonemic string corresponding to the heteronym as pronounced by the user in the speech input; and

a frequency of occurrence of an n-gram with respect to a corpus, wherein the n-gram includes the heteronym and the one or more additional words;

determining a correct pronunciation of the heteronym based on at least one of the phonemic string and the frequency of occurrence of the n-gram;

generating a dialogue response to the speech input, wherein the dialogue response includes the heteronym; and

outputting the dialogue response as a speech output, wherein the heteronym in the dialogue response is pronounced in the speech output according to the determined correct pronunciation.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems and processes for disambiguating heteronyms in speech synthesis are provided. In one example process, a speech input containing a heteronym can be received from a user. The speech input can be processed using an automatic speech recognition system to determine a phonemic string corresponding to the heteronym as pronounced by the user in the speech input. A correct pronunciation of the heteronym can be determined based on at least one of the phonemic string or using an n-gram language model of the automatic speech recognition system. A dialogue response to the speech input can be generated where the dialogue response can include the heteronym. The dialogue response can be outputted as a speech output. The heteronym in the dialogue response can be pronounced in the speech output according to the correct pronunciation.

Citations

25 Claims

1. A method for operating an intelligent automated assistant, the method comprising:
- at an electronic device with a processor and memory storing one or more programs for execution by the processor;
  
  receiving, from a user, a speech input containing a heteronym and one or more additional words;
  
  processing the speech input using an automatic speech recognition system to determine at least one of;
  
  a phonemic string corresponding to the heteronym as pronounced by the user in the speech input; and
  
  a frequency of occurrence of an n-gram with respect to a corpus, wherein the n-gram includes the heteronym and the one or more additional words;
  
  determining a correct pronunciation of the heteronym based on at least one of the phonemic string and the frequency of occurrence of the n-gram;
  
  generating a dialogue response to the speech input, wherein the dialogue response includes the heteronym; and
  
  outputting the dialogue response as a speech output, wherein the heteronym in the dialogue response is pronounced in the speech output according to the determined correct pronunciation.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
- - 2. The method of claim 1, wherein processing the speech input using the automatic speech recognition system includes determining a text string corresponding to the speech input, and further comprising:
    - determining an actionable intent based on the text string, wherein the correct pronunciation of the heteronym is determined based on at least one of the phonemic string, the frequency of occurrence of the n-gram, and the actionable intent.
  - 3. The method of claim 2, further comprising:
    - assigning the heteronym to a parameter of the actionable intent, wherein the correct pronunciation of the heteronym is determined based at least in part on the parameter.
  - 4. The method of claim 2, wherein:
    - a vocabulary list is associated with the actionable intent;
      
      the vocabulary list includes the heteronym;
      
      the heteronym in the vocabulary list is associated with a particular pronunciation; and
      
      the correct pronunciation of the heteronym is determined based on the particular pronunciation associated with the heteronym in the vocabulary list.
  - 5. The method of claim 2, further comprising:
    - receiving contextual information associated with the speech input, wherein the actionable intent is determined based at least in part on the contextual information.
  - 6. The method of claim 1, wherein:
    - the heteronym in the n-gram is associated with a first pronunciation;
      
      processing the speech input using the automatic speech recognition system includes determining a frequency of occurrence of a second n-gram with respect to the corpus;
      
      the second n-gram includes the heteronym and the one or more additional words;
      
      the heteronym in the second n-gram is associated with a second pronunciation; and
      
      the correct pronunciation of the heteronym is determined based on the frequency of occurrence of the n-gram and the frequency of occurrence of the second n-gram.
  - 7. The method of claim 6, wherein the frequency of occurrence of the n-gram is greater than the frequency of occurrence of the second n-gram by at least a predetermined amount, and wherein the correct pronunciation of the heteronym is determined to be the first pronunciation.
  - 8. The method of claim 6, wherein the frequency of occurrence of the first n-gram is greater than a first predetermined threshold value, wherein the frequency of occurrence of the second n-gram is less than a second predetermined threshold value, and wherein the correct pronunciation of the heteronym is determined to be the first pronunciation.
  - 9. The method of claim 6, wherein the phonemic string corresponds to the second pronunciation, wherein the frequency of occurrence of the n-gram is greater than the frequency of occurrence of the second n-gram by at least a predetermined amount, and wherein the correct pronunciation of the heteronym is determined to be the first pronunciation.
  - 10. The method of claim 1, further comprising:
    - obtaining from the automatic speech recognition system a second phonemic string corresponding to the determined correct pronunciation, wherein outputting the dialogue response includes synthesizing the heteronym in the dialogue response using a speech synthesizer, and wherein the speech synthesizer uses the second phonemic string to synthesize the heteronym in the speech output according to the correct pronunciation.
  - 11. The method of claim 1, further comprising:
    - annotating the heteronym in the dialogue response with a tag to identify the correct pronunciation of the heteronym, wherein outputting the dialogue response includes synthesizing the heteronym in the dialogue response using a speech synthesizer, and wherein the heteronym in the dialogue response is synthesized based on the tag.
  - 12. The method of claim 1, further comprising:
    - receiving contextual information associated with the speech input, wherein the correct pronunciation of the heteronym is determined based at least in part on the contextual information.
  - 13. The method of claim 1, wherein the correct pronunciation of the heteronym is determined based at least in part on a custom pronunciation of the heteronym that is associated with the user, and wherein the custom pronunciation is based on a previous speech input received from the user.

14. A method for operating an intelligent automated assistant, the method comprising:
- at an electronic device with a processor and memory storing one or more programs for execution by the processor;
  
  receiving, from a user, a speech input;
  
  processing the speech input using an automatic speech recognition system to determine a text string corresponding to the speech input;
  
  determining an actionable intent based on the text string;
  
  generating a dialogue response to the speech input based on the actionable intent, wherein the dialogue response includes a heteronym;
  
  determining a correct pronunciation of the heteronym using an n-gram language model of the automatic speech recognition system and based on the heteronym and one or more additional words in the dialogue response; and
  
  outputting the dialogue response as a speech output, wherein the heteronym in the dialogue response is pronounced in the speech output according to the determined correct pronunciation.
- View Dependent Claims (15, 16, 17, 18, 19)
- - 15. The method of claim 14, wherein the one or more additional words precede the heteronym in the dialogue response.
  - 16. The method of claim 14, further comprising:
    - obtaining from the automatic speech recognition system a phonemic string corresponding to the determined correct pronunciation, wherein outputting the dialogue response includes synthesizing the heteronym in the dialogue response using a speech synthesizer, and wherein the speech synthesizer uses the phonemic string to synthesize the heteronym in the speech output according to the determined correct pronunciation.
  - 17. The method of claim 14, further comprising:
    - annotating the heteronym in the dialogue response with a tag to identify the correct pronunciation of the heteronym, wherein outputting the dialogue response includes synthesizing the heteronym in the dialogue response using a speech synthesizer, and wherein the heteronym in the dialogue response is synthesized based on the tag.
  - 18. The method of claim 14, further comprising:
    - receiving contextual information associated with the speech input, wherein the correct pronunciation of the heteronym is determined based at least in part on the contextual information.
  - 19. The method of claim 14, wherein the correct pronunciation of the heteronym is determined based at least in part on a custom pronunciation of the heteronym that is associated with the user, and wherein the custom pronunciation is based on a previous speech input received from the user.

20. A method for operating an intelligent automated assistant, the method comprising:
- at an electronic device with a processor and memory storing one or more programs for execution by the processor;
  
  receiving, from a user, a speech input containing a heteronym and one or more additional words;
  
  processing the speech input using an automatic speech recognition system to determine a phonemic string corresponding to the heteronym as pronounced by the user in the speech input;
  
  generating a dialogue response to the speech input, wherein the dialogue response includes the heteronym; and
  
  outputting the dialogue response as a speech output, wherein the heteronym in the dialogue response is pronounced in the speech output according to the phonemic string.
- View Dependent Claims (21, 22, 23)
- - 21. The method of claim 20, wherein the phonemic string is determined using an acoustic model of the automatic speech recognition system.
  - 22. The method of claim 20, wherein outputting the dialogue response includes synthesizing the heteronym in the dialogue response using a speech synthesizer, and wherein the dialogue response is synthesized based on the phonemic string.
  - 23. The method of claim 20, wherein the phonemic string is stored in metadata that is associated with the heteronym in the dialogue response, and wherein the metadata is accessed by the speech synthesize to synthesize the heteronym in the dialogue response according to the phonemic string.

24. A non-transitory computer-readable storage medium comprising instructions for causing one or more processors to:
- receive, from a user, a speech input containing a heteronym and one or more additional words;
  
  process the speech input using an automatic speech recognition system to determine a text string corresponding to the speech input, wherein processing the speech input includes determining at least one of;
  
  a phonemic string corresponding to the heteronym as pronounced by the user in the speech input; and
  
  a frequency of occurrence of an n-gram with respect to a corpus, wherein the n-gram includes the heteronym and the one or more additional words;
  
  determine an actionable intent based on the text string;
  
  determine a correct pronunciation of the heteronym based on at least one of the phonemic string, the frequency of occurrence of the n-gram, and the actionable intent;
  
  generate a dialogue response to the speech input, wherein the dialogue response includes the heteronym; and
  
  output the dialogue response as a speech output, wherein the heteronym in the dialogue response is pronounced in the speech output according to the determined correct pronunciation.

25. An electronic device comprising:
- one or more processors;
  
  memory;
  
  one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs including instructions for;
  
  receiving, from a user, a speech input containing a heteronym and one or more additional words;
  
  processing the speech input using an automatic speech recognition system to determine a text string corresponding to the speech input, wherein processing the speech input includes determining at least one of;
  
  a phonemic string corresponding to the heteronym as pronounced by the user in the speech input; and
  
  a frequency of occurrence of an n-gram with respect to a corpus, wherein the n-gram includes the heteronym and the one or more additional words;
  
  determining an actionable intent based on the text string;
  
  determining a correct pronunciation of the heteronym based on at least one of the phonemic string, the frequency of occurrence of the n-gram, and the actionable intent;
  
  generating a dialogue response to the speech input, wherein the dialogue response includes the heteronym; and
  
  outputting the dialogue response as a speech output, wherein the heteronym in the dialogue response is pronounced in the speech output according to the determined correct pronunciation.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Apple Inc.
Original Assignee
Apple Inc.
Inventors
Henton, Caroline, Naik, Devang

Granted Patent

US 9,711,141 B2
Time in Patent Office

Days
Field of Search
US Class Current
CPC Class Codes

G10L 13/08   Text analysis or generation...

G10L 15/22   Procedures used during a sp...

G10L 2015/225   Feedback of the input speech

DISAMBIGUATING HETERONYMS IN SPEECH SYNTHESIS

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

25 Claims

Specification

Solutions

Use Cases

Quick Links

DISAMBIGUATING HETERONYMS IN SPEECH SYNTHESIS

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

25 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links