Pronunciation guided by automatic speech recognition

US 10,319,250 B2
Filed: 02/22/2017
Issued: 06/11/2019
Est. Priority Date: 12/29/2016
Status: Active Grant

First Claim

Patent Images

1. At least one non-transitory computer readable medium storing computer code that, if executed by at least one computer processor within a pronunciation training system, would cause the at least one computer processor to:

receive a spoken utterance from a speaker;

detect an end of the spoken utterance;

perform automatic speech recognition on the spoken utterance to produce a transcription of the words in the spoken utterance;

perform speech synthesis from the transcription to produce synthesized speech according to a phonetic dictionary having mappings to single phoneme sequences representing standard pronunciation; and

responsive to detecting the end of the spoken utterance, output the synthesized speech,wherein the system, in response to detecting the end of the spoken utterance, provides a user with an audible output of the transcription, spoken with the standard pronunciation.

View all claims

10 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Speech synthesis chooses pronunciations of words with multiple acceptable pronunciations based on an indication of a personal, class-based, or global preference or an intended non-preferred pronunciation. A speaker'"'"'s words can be parroted back on personal devices using preferred pronunciations for accent training. Degrees of pronunciation error are computed and indicated to the user in a visual transcription or audibly as word emphasis in parroted speech. Systems can use sets of phonemes extended beyond those generally recognized for a language. Speakers are classified in order to choose specific phonetic dictionaries or adapt global ones. User profiles maintain lists of which pronunciations are preferred among ones acceptable for words with multiple recognized pronunciations. Systems use multiple correlations of word preferences across users to predict use preferences of unlisted words. Speaker-preferred pronunciations are used to weight the scores of transcription hypotheses based on phoneme sequence hypotheses in speech engines.

Citations

21 Claims

1. At least one non-transitory computer readable medium storing computer code that, if executed by at least one computer processor within a pronunciation training system, would cause the at least one computer processor to:
- receive a spoken utterance from a speaker;
  
  detect an end of the spoken utterance;
  
  perform automatic speech recognition on the spoken utterance to produce a transcription of the words in the spoken utterance;
  
  perform speech synthesis from the transcription to produce synthesized speech according to a phonetic dictionary having mappings to single phoneme sequences representing standard pronunciation; and
  
  responsive to detecting the end of the spoken utterance, output the synthesized speech,wherein the system, in response to detecting the end of the spoken utterance, provides a user with an audible output of the transcription, spoken with the standard pronunciation.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The at least one non-transitory computer readable medium of claim 1 wherein the at least one computer processor would further be caused to compute error values, each error value indicates a degree of pronunciation error in a corresponding word in the spoken utterance.
  - 3. The at least one non-transitory computer readable medium of claim 2 wherein the at least one computer processor would further be caused to use the error values to determine a level of emphasis for the corresponding word in speech synthesis.
  - 4. The at least one non-transitory computer readable medium of claim 2 wherein the at least one computer processor would further be caused to:
    - cause a visual display to show the transcription as text; and
      
      highlight the text of one or more words in the transcription in response to each error value in the corresponding word.
  - 5. The at least one non-transitory computer readable medium of claim 1 wherein the at least one computer processor would further be caused to:
    - compute error values that each indicate a pronunciation error in a corresponding syllable in the spoken utterance; and
      
      use the error value for each syllable to determine a level of stress for the corresponding syllable in the speech synthesis.
  - 6. The at least one non-transitory computer readable medium of claim 1 wherein the at least one computer processor would further be caused to:
    - determine a prosody attribute from the spoken utterance; and
      
      adapt the speech synthesis according to the prosody attribute.
  - 7. The at least one non-transitory computer readable medium of claim 6 wherein the prosody attribute is an emphasis.
  - 8. The at least one non-transitory computer readable medium of claim 6 wherein the prosody attribute is a speech rate.
  - 9. The at least one non-transitory computer readable medium of claim 1 wherein the at least one computer processor would further be caused to:
    - create a recording of the spoken utterance; and
      
      output the recording,wherein the user can aurally compare the recording and the synthesized speech.

10. A method of determining which of a plurality of pronunciations of a word in a digital phonetic dictionary to use in speech synthesis, the method comprising:
- receiving a spoken utterance from a speaker;
  
  recognizing from the spoken utterance, using a speech engine, a word that has multiple known pronunciations;
  
  selecting a preferred pronunciation from the multiple known pronunciations of the word that matches the speaker'"'"'s pronunciation of the word used in the spoken utterance; and
  
  storing, in a user profile associated with the speaker, an indication of the preferred pronunciation that the speaker used,wherein the speech engine uses the stored preferred pronunciation to select between multiple known pronunciations of the word when synthesizing the speaker'"'"'s future spoken utterances that includes the word.
- View Dependent Claims (11, 12, 13, 14, 15)
- - 11. The method of claim 10 further comprising choosing for a spoken word, based on the indication, to output, one of multiple known pronunciations of the spoken word.
  - 12. The method of claim 10 further comprising classifying the speaker based on the determination of the pronunciation of the multiple known pronunciations of the spoken word the speaker used in the spoken utterance.
  - 13. The method of claim 12 further comprising choosing for the spoken word, based on the classification, to output one of multiple known pronunciations of the spoken word.
  - 14. The method of claim 12 further comprising using the classification to filter a phonetic dictionary.
  - 15. The method of claim 12 further comprising using the classification to configure an acoustic model.

16. A method of determining which of a plurality of pronunciations of a word in a phonetic dictionary to use in speech synthesis, the method comprising:
- determining a text word to synthesize, the text word having multiple known pronunciations;
  
  looking up, in a user profile, a preferred pronunciation of a profile word that has multiple known pronunciations;
  
  calculating a correlation between the preferred pronunciation of the profile word and each of multiple pronunciations of the text word; and
  
  choosing one of the multiple pronunciations of the text word at least partially based on its correlation to the preferred pronunciation of the profile word.
- View Dependent Claims (17)
- - 17. The method of claim 16 further comprising:
    - looking up, in the user profile, a preferred pronunciation of a second profile word that has multiple known pronunciations; and
      
      calculating a correlation between the preferred pronunciation of the second profile word and each of the multiple pronunciations of the text word.

18. A method of configuring a phonetic dictionary, the method comprising:
- analyzing, across a multiplicity of users, user'"'"'s profile word lists for a plurality of pronunciations of at least one word recognized with multiple pronunciations;
  
  determining, based on the analysis, that a pronunciation most frequently used across the multiplicity of users is a preferred one of the plurality of pronunciations; and
  
  update the phonetic dictionary to indicate the preferred one of the plurality of pronunciations for speech recognition.
- View Dependent Claims (19, 20)
- - 19. The method of claim 18 further comprising determining, based on the analysis, an order of preference of each of the plurality of pronunciations.
  - 20. The method of claim 18 wherein the analysis only considers users within a particular class.

21. A method of improving the accuracy of automatic speech recognition, the method comprising:
- determining, by a speech engine, a multiplicity of phoneme sequence hypotheses from a spoken utterance;
  
  determining, by the speech engine, a multiplicity of transcription hypotheses, each transcription hypothesis being based on;
  
  a match between the phoneme sequence hypothesis and a pronunciation of a word in a phonetic dictionary, the word having a plurality of pronunciations; and
  
  an indication of at least one pronunciation being preferred, wherein the preferred pronunciation is determined by frequency of usage determined by crowdsourcing; and
  
  calculating a likelihood score for each transcription hypothesis, the likelihood score being positively correlated to the matched pronunciation being the preferred pronunciation of the word.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Soundhound AI IP LLC (SoundHound AI, Inc. (f/k/a Archimedes Tech SPAC Partners Co.))
Original Assignee
SoundHound, Inc. (SoundHound AI, Inc. (f/k/a Archimedes Tech SPAC Partners Co.))
Inventors
Lokeswarappa, Kiran Garaga, Probell, Jonah
Primary Examiner(s)
Chawan, Vijay B

Application Number

US15/439,883
Publication Number

US 20180190269A1
Time in Patent Office

839 Days
Field of Search

704243, 704254, 704255, 704260, 704266, 704231, 704251, 704270, 704275, 704 3, 704 7, 434185, 434156, 434169, 434157, 434308
US Class Current
CPC Class Codes

G09B 19/06   Foreign languages with audi...

G09B 5/04   with audible presentation o...

G10L 13/00   Speech synthesis; Text to s...

G10L 15/26   Speech to text systems G10L...

G10L 2015/225   Feedback of the input speech

Pronunciation guided by automatic speech recognition

First Claim

10 Assignments

0 Petitions

Accused Products

Abstract

Citations

21 Claims

Specification

Solutions

Use Cases

Quick Links

Pronunciation guided by automatic speech recognition

First Claim

10 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

21 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links