Learning personalized entity pronunciations

US 10,152,965 B2
Filed: 02/03/2016
Issued: 12/11/2018
Est. Priority Date: 02/03/2016
Status: Active Grant

First Claim

Patent Images

1. A method comprising:

receiving audio data corresponding to an utterance that is spoken by a user of a device and that includes a voice command trigger term and an entity name that is a proper noun;

generating, by an automated speech recognizer, a first phonetic representation of a first portion of the utterance that is associated with the entity name that is a proper noun, wherein the first phonetic pronunciation does not phonetically correspond to a previously available phonetic pronunciation of the entity name;

generating, by the automated speech recognizer, an initial transcription that (i) is based on the first phonetic representation of the first portion of the utterance, and (ii) includes a transcription of a term that is not a proper noun;

in response to the generation of the initial transcription that includes a transcription of the term that is not a proper noun, prompting a user for feedback, wherein prompting the user for feedback comprises;

providing, for output to the user on a graphical user interface of the device, a representation of the initial transcription that (i) is based on the first phonetic pronunciation of the first portion of the utterance, and (ii) includes the transcription of the term that is not a proper noun;

providing, for output to the user on the graphical user interface, multiple entity names from a set of entity names stored in the pronunciation dictionary, wherein the multiple entity names that are provided for output on the graphical user interface include both (i) entity names that are phonetically close to the entity name included in the utterance, and (ii) entity names that are phonetically unrelated to the entity name included in the utterance; and

receiving data corresponding to a selection by the user of a particular entity name of the multiple entity names;

generating a different transcription based on the received data corresponding to the particular entity name selected by the user, wherein the different transcription includes an entity name that does not phonetically correspond to the first phonetic representation;

updating the pronunciation dictionary to associate (i) the first phonetic representation of the first portion of the utterance that corresponds to the portion of the utterance that is associated with the entity name that is a proper noun with (ii) the entity name in the pronunciation dictionary corresponding to the different transcription that does not phonetically correspond to the first phonetic representation;

receiving a subsequent utterance that includes the entity name; and

transcribing the subsequent utterance based at least in part on the updated pronunciation dictionary.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Methods, systems, and apparatus, including computer programs encoded on computer storage medium, for implementing a pronunciation dictionary that stores entity name pronunciations. In one aspect, a method includes actions of receiving audio data corresponding to an utterance that includes a command and an entity name. Additional actions may include generating, by an automated speech recognizer, an initial transcription for a portion of the audio data that is associated with the entity name, receiving a corrected transcription for the portion of the utterance that is associated with the entity name, obtaining a phonetic pronunciation that is associated with the portion of the audio data that is associated with the entity name, updating a pronunciation dictionary to associate the phonetic pronunciation with the entity name, receiving a subsequent utterance that includes the entity name, and transcribing the subsequent utterance based at least in part on the updated pronunciation dictionary.

Citations

18 Claims

1. A method comprising:
- receiving audio data corresponding to an utterance that is spoken by a user of a device and that includes a voice command trigger term and an entity name that is a proper noun;
  
  generating, by an automated speech recognizer, a first phonetic representation of a first portion of the utterance that is associated with the entity name that is a proper noun, wherein the first phonetic pronunciation does not phonetically correspond to a previously available phonetic pronunciation of the entity name;
  
  generating, by the automated speech recognizer, an initial transcription that (i) is based on the first phonetic representation of the first portion of the utterance, and (ii) includes a transcription of a term that is not a proper noun;
  
  in response to the generation of the initial transcription that includes a transcription of the term that is not a proper noun, prompting a user for feedback, wherein prompting the user for feedback comprises;
  
  providing, for output to the user on a graphical user interface of the device, a representation of the initial transcription that (i) is based on the first phonetic pronunciation of the first portion of the utterance, and (ii) includes the transcription of the term that is not a proper noun;
  
  providing, for output to the user on the graphical user interface, multiple entity names from a set of entity names stored in the pronunciation dictionary, wherein the multiple entity names that are provided for output on the graphical user interface include both (i) entity names that are phonetically close to the entity name included in the utterance, and (ii) entity names that are phonetically unrelated to the entity name included in the utterance; and
  
  receiving data corresponding to a selection by the user of a particular entity name of the multiple entity names;
  
  generating a different transcription based on the received data corresponding to the particular entity name selected by the user, wherein the different transcription includes an entity name that does not phonetically correspond to the first phonetic representation;
  
  updating the pronunciation dictionary to associate (i) the first phonetic representation of the first portion of the utterance that corresponds to the portion of the utterance that is associated with the entity name that is a proper noun with (ii) the entity name in the pronunciation dictionary corresponding to the different transcription that does not phonetically correspond to the first phonetic representation;
  
  receiving a subsequent utterance that includes the entity name; and
  
  transcribing the subsequent utterance based at least in part on the updated pronunciation dictionary.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The method of claim 1, wherein updating a pronunciation dictionary further comprises:
    - identifying a pronunciation dictionary entry that is associated with the entity name;
      
      deleting the portion of the entry that corresponds to a phonetic representation of the initial transcription; and
      
      storing, in the pronunciation dictionary entry that is associated with the entity name, the phonetic representation that is associated with the first phonetic representation.
  - 3. The method of claim 1, further comprising:
    - associating a time stamp with at least a portion of the received audio data that is associated with the first portion of the utterance; and
      
      caching one or more portions of the received audio data until the different transcription of the utterance is identified and the command associated with the received utterance is completed.
  - 4. The method of claim 3, further comprising:
    - identifying a most recently received audio data based on the timestamp; and
      
      generating a phonetic representation of the first portion of the utterance that is represented by the obtained portion of the most recently received audio data based on a set of phonemes obtained using an acoustic model.
  - 5. The method of claim 1, further comprising:
    - in response to updating a pronunciation dictionary to include the first phonetic representation, increasing a global counter associated with the first phonetic representation.
  - 6. The method of claim 5, further comprising:
    - determining that the global counter associated with the first phonetic representation satisfies a predetermined threshold; and
      
      in response to determining that the global counter associated with the first phonetic pronunciation has exceeded a predetermined threshold, updating a pronunciation dictionary entry in a global pronunciation dictionary that is associated with the entity name to include the first phonetic representation associated with the different transcription.

7. A system comprising:
- one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising;
  
  receiving audio data corresponding to an utterance that is spoken by a user of a device and that includes a voice command trigger term and an entity name that is a proper noun;
  
  generating, by an automated speech recognizer, a first phonetic representation of a first portion of the utterance that is associated with the entity name that is a proper noun, wherein the first phonetic pronunciation does not phonetically correspond to a previously available phonetic pronunciation of the entity name;
  
  generating, by the automated speech recognizer, an initial transcription that (i) is based on the first phonetic representation of the first portion of the utterance, and (ii) includes a transcription of a term that is not a proper noun;
  
  in response to the generation of the initial transcription that includes a transcription of the term that is not a proper noun, prompting a user for feedback, wherein prompting the user for feedback comprises;
  
  providing, for output to the user on a graphical user interface of the device, a representation of the initial transcription that (i) is based on the first phonetic pronunciation of the first portion of the utterance, and (ii) includes the transcription of the term that is not a proper noun;
  
  providing, for output to the user on the graphical user interface, multiple entity names from a set of entity names stored in the pronunciation dictionary, wherein the multiple entity names that are provided for output on the graphical user interface include both (i) entity names that are phonetically close to the entity name included in the utterance, and (ii) entity names that are phonetically unrelated to the entity name included in the utterance; and
  
  receiving data corresponding to a selection by the user of a particular entity name of the multiple entity names;
  
  generating a different transcription based on the received data corresponding to the particular entity name selected by the user, wherein the different transcription includes an entity name that does not phonetically correspond to the first phonetic representation;
  
  updating the pronunciation dictionary to associate (i) the first phonetic representation of the first portion of the utterance that corresponds to the portion of the utterance that is associated with the entity name that is a proper noun with (ii) the entity name in the pronunciation dictionary corresponding to the different transcription that does not phonetically correspond to the first phonetic representation;
  
  receiving a subsequent utterance that includes the entity name; and
  
  transcribing the subsequent utterance based at least in part on the updated pronunciation dictionary.
- View Dependent Claims (8, 9, 10, 11, 12)
- - 8. The system of claim 7, wherein updating a pronunciation dictionary further comprises:
    - identifying a pronunciation dictionary entry that is associated with the entity name;
      
      deleting the portion of the entry that corresponds to a phonetic representation of the initial transcription; and
      
      storing, in the pronunciation dictionary entry that is associated with the entity name, the phonetic representation that is associated with the first phonetic representation.
  - 9. The system of claim 7, wherein the operations further comprise:
    - associating a time stamp with at least a portion of the received audio data that is associated with the first portion of the utterance; and
      
      caching one or more portions of the received audio data until the different transcription of the utterance is identified and the command associated with the received utterance is completed.
  - 10. The system of claim 7, wherein the operations further comprise:
    - identifying a most recently received audio data based on the timestamp; and
      
      generating a phonetic representation of the first portion of the utterance that is represented by the obtained portion of the most recently received audio data based on a set of phonemes obtained using an acoustic model.
  - 11. The system of claim 7, wherein the operations further comprise:
    - in response to updating a pronunciation dictionary to include the first phonetic representation, increasing a global counter associated with the first phonetic representation.
  - 12. The system of claim 11, wherein the operations further comprise:
    - determining that the global counter associated with the first phonetic representation satisfies a predetermined threshold; and
      
      in response to determining that the global counter associated with the first phonetic pronunciation has exceeded a predetermined threshold, updating a pronunciation dictionary entry in a global pronunciation dictionary that is associated with the entity name to include the first phonetic representation associated with the different transcription.

13. A non-transitory computer-readable medium storing software comprising instructions executable by one or more computers which, upon such execution, cause the one or more computers to perform operations comprising:
- receiving audio data corresponding to an utterance that is spoken by a user of a device and that includes a voice command trigger term and an entity name that is a proper noun;
  
  generating, by an automated speech recognizer, a first phonetic representation of a first portion of the utterance that is associated with the entity name that is a proper noun, wherein the first phonetic pronunciation does not phonetically correspond to a previously available phonetic pronunciation of the entity name;
  
  generating, by the automated speech recognizer, an initial transcription that (i) is based on the first phonetic representation of the first portion of the utterance, and (ii) includes a transcription of a term that is not a proper noun;
  
  in response to the generation of the initial transcription that includes a transcription of the term that is not a proper noun, prompting a user for feedback, wherein prompting the user for feedback comprises;
  
  providing, for output to the user on a graphical user interface of the device, a representation of the initial transcription that (i) is based on the first phonetic pronunciation of the first portion of the utterance, and (ii) includes the transcription of the term that is not a proper noun;
  
  providing, for output to the user on the graphical user interface, multiple entity names from a set of entity names stored in the pronunciation dictionary, wherein the multiple entity names that are provided for output on the graphical user interface include both (i) entity names that are phonetically close to the entity name included in the utterance, and (ii) entity names that are phonetically unrelated to the entity name included in the utterance; and
  
  receiving data corresponding to a selection by the user of a particular entity name of the multiple entity names;
  
  generating a different transcription based on the received data corresponding to the particular entity name selected by the user, wherein the different transcription includes an entity name that does not phonetically correspond to the first phonetic representation;
  
  updating the pronunciation dictionary to associate (i) the first phonetic representation of the first portion of the utterance that corresponds to the portion of the utterance that is associated with the entity name that is a proper noun with (ii) the entity name in the pronunciation dictionary corresponding to the different transcription that does not phonetically correspond to the first phonetic representation;
  
  receiving a subsequent utterance that includes the entity name; and
  
  transcribing the subsequent utterance based at least in part on the updated pronunciation dictionary.
- View Dependent Claims (14, 15, 16, 17, 18)
- - 14. The computer-readable medium of claim 13, wherein updating a pronunciation dictionary further comprises:
    - identifying a pronunciation dictionary entry that is associated with the entity name;
      
      deleting the portion of the entry that corresponds to a phonetic representation of the initial transcription; and
      
      storing, in the pronunciation dictionary entry that is associated with the entity name, the phonetic representation that is associated with the first phonetic representation.
  - 15. The computer-readable medium of claim 13, wherein the operations further comprise:
    - associating a time stamp with at least a portion of the received audio data that is associated with the first portion of the utterance; and
      
      caching one or more portions of the received audio data until the different transcription of the utterance is identified and the command associated with the received utterance is completed.
  - 16. The computer-readable medium of claim 13, wherein obtaining a phonetic representation that is associated with the manually selected term comprises:
    - identifying a most recently received audio data based on the timestamp; and
      
      generating a phonetic representation of the first portion of the utterance that is represented by the obtained portion of the most recently received audio data based on a set of phonemes obtained using an acoustic model.
  - 17. The computer-readable medium of claim 13, wherein the operations further comprise:
    - in response to updating a pronunciation dictionary to include the first phonetic representation, increasing a global counter associated with the first phonetic representation.
  - 18. The computer-readable medium of claim 13, wherein the operations further comprise:
    - determining that the global counter associated with the first phonetic representation satisfies a predetermined threshold; and
      
      in response to determining that the global counter associated with the first phonetic pronunciation has exceeded a predetermined threshold, updating a pronunciation dictionary entry in a global pronunciation dictionary that is associated with the entity name to include the first phonetic representation associated with the different transcription.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google LLC (Alphabet Inc.)
Inventors
Bruguier, Antoine Jean, Peng, Fuchun, Beaufays, Francoise
Primary Examiner(s)
Sirjani, Fariba

Application Number

US15/014,213
Publication Number

US 20170221475A1
Time in Patent Office

1,042 Days
Field of Search

None
US Class Current
CPC Class Codes

G10L 15/063   Training

G10L 15/065   Adaptation

G10L 15/26   Speech to text systems G10L...

G10L 2015/0635   updating or merging of old ...

G10L 2015/0636   Threshold criteria for the ...

Learning personalized entity pronunciations

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

18 Claims

Specification

Solutions

Use Cases

Quick Links

Learning personalized entity pronunciations

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

18 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links