PREDICTING PRONUNCIATION IN SPEECH RECOGNITION

US 20150255069A1
Filed: 03/04/2014
Published: 09/10/2015
Est. Priority Date: 03/04/2014
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method for processing a spoken utterance, the method comprising:

determining at least one language of origin of a song title based at least in part on a spelling of the song title;

determining a plurality of potential pronunciations of the song title based at least in part on the at least one language of origin and a language spoken by a user, wherein each of the plurality of potential pronunciations is associated with a score;

storing an association between each of the plurality of potential pronunciations and the song title;

receiving a spoken utterance comprising a request to play a song;

matching a portion of the spoken utterance with one of the plurality of potential pronunciations based at least in part on a score of the one of the plurality of potential pronunciations;

identifying the song based at least in part on the one of the plurality of potential pronunciations; and

causing the song to be played on a computing device.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

An automatic speech recognition (ASR) device may be configured to predict pronunciations of textual identifiers (for example, song names, etc.) based on predicting one or more languages of origin of the textual identifier. The one or more languages of origin may be determined based on the textual identifier. The pronunciations may include a hybrid pronunciation including a pronunciation in one language, a pronunciation in a second language and a hybrid pronunciation that combines multiple languages. The pronunciations may be added to a lexicon and matched to the content item (e.g., song) and/or textual identifier. The ASR device may receive a spoken utterance from a user requesting the ASR device to access the content item. The ASR device determines whether the spoken utterance matches one of the pronunciations of the content item in the lexicon. The ASR device then accesses the content when the spoken utterance matches one of the potential textual identifier pronunciations.

Citations

26 Claims

1. A computer-implemented method for processing a spoken utterance, the method comprising:
- determining at least one language of origin of a song title based at least in part on a spelling of the song title;
  
  determining a plurality of potential pronunciations of the song title based at least in part on the at least one language of origin and a language spoken by a user, wherein each of the plurality of potential pronunciations is associated with a score;
  
  storing an association between each of the plurality of potential pronunciations and the song title;
  
  receiving a spoken utterance comprising a request to play a song;
  
  matching a portion of the spoken utterance with one of the plurality of potential pronunciations based at least in part on a score of the one of the plurality of potential pronunciations;
  
  identifying the song based at least in part on the one of the plurality of potential pronunciations; and
  
  causing the song to be played on a computing device.
- View Dependent Claims (2, 3, 4)
- - 2. The method of claim 1, in which determining the plurality of potential pronunciations is further based at least in part on a user pronunciation history of a word with at least one language of origin in common with the song title.
  - 3. The method of claim 1, further comprising determining at least one potential pronunciation by associating a first language of origin with one portion of the song title and a second language of origin with a second portion of the song title.
  - 4. The method of claim 1, in which determining the at least one language of origin of the song title is based at least in part on a language of origin of other songs capable of being played by the computing device.

5. A computing system, comprising:
- at least one processor;
  
  a memory device including instructions operable to be executed by the at least one processor to perform a set of actions, the instructions configuring the at least one processor;
  
  to determine a potential language of origin for a textual identifier, wherein the potential language of origin is based at least in part on textual identifier;
  
  to determine a potential pronunciation of the textual identifier, wherein the potential pronunciation is based at least in part on the potential language of origin and a potential spoken language; and
  
  to store an association between the potential pronunciation and the textual identifier.
- View Dependent Claims (6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
- - 6. The computing system of claim 5, wherein the instructions further configure the at least one processor:
    - to determine a second potential language of origin for the textual identifier, wherein the second potential language of origin is based at least in part on the textual identifier;
      
      to determine a second potential pronunciation of the textual identifier, wherein the second potential pronunciation is based at least in part on the second potential language of origin; and
      
      to store an association between the second potential pronunciation and the textual identifier.
  - 7. The computing system of claim 6, wherein the potential language of origin, second potential language of origin, potential pronunciation and second potential pronunciation are each associated with a respective score.
  - 8. The computing system of claim 5, wherein the at least one processor is further configured to determine a second potential language of origin of the textual identifier, and wherein:
    - the potential language of origin is associated with a first portion of the textual identifier,the second potential language of origin is associated with a second portion of the textual identifier, andthe potential pronunciation is further based at least in part on the second potential language of origin.
  - 9. The computing system of claim 5, wherein the at least one processor is further configured to determine the potential pronunciation further based at least in part on a pronunciation history of a user.
  - 10. The computing system of claim 9, wherein the pronunciation history of a user comprises a language spoken by the user.
  - 11. The computing system of claim 5, wherein the at least one processor is further configured to determine the potential language of origin further based at least in part on a language of origin of a second textual identifier associated with the textual identifier.
  - 12. The computing system of claim 5, wherein the instructions further configure the at least one processor:
    - to receive audio data comprising an utterance;
      
      to identify the potential pronunciation in the utterance;
      
      to identify the textual identifier based on the stored association; and
      
      to retrieve at least a portion of a content item associated with the textual identifier.
  - 13. The computing system of claim 5, wherein the textual identifier comprises a name of an artist, album, band, movie, book, song and/or food item to be accessed by the computing device.
  - 14. The computing system of claim 5, wherein the potential spoken language comprises a language associated with a location of a device of the system.
  - 15. The computing system of claim 5, wherein the at least one processor is further configured to determine the potential pronunciation of the textual identifier using at least one of a finite state transducer (FST) model, a maximum entropy model, a character level language model and/or a conditional random fields model.

16. A non-transitory computer-readable storage medium storing processor-executable instructions for controlling a computing device, comprising:
- program code to determine a potential language of origin for a textual identifier, wherein the potential language of origin is based at least in part on textual identifier;
  
  program code to determine a potential pronunciation of the textual identifier, wherein the potential pronunciation is based at least in part on the potential language of origin and a potential spoken language; and
  
  program code to store an association between the potential pronunciation and the textual identifier.
- View Dependent Claims (17, 18, 19, 20, 21, 22, 23, 24, 25, 26)
- - 17. The non-transitory computer-readable storage medium of claim 16, further comprising:
    - program code to determine a second potential language of origin for the textual identifier, wherein the second potential language of origin is based at least in part on the textual identifier;
      
      program code to determine a second potential pronunciation of the textual identifier, wherein the second potential pronunciation is based at least in part on the second potential language of origin; and
      
      program code to store an association between the second potential pronunciation and the textual identifier.
  - 18. The non-transitory computer-readable storage medium of claim 17, wherein the potential language of origin, second potential language of origin, potential pronunciation and second potential pronunciation are each associated with a respective score.
  - 19. The non-transitory computer-readable storage medium of claim 16, further comprising program code to determine a second potential language of origin of the textual identifier, and wherein:
    - the potential language of origin is associated with a first portion of the textual identifier,the second potential language of origin is associated with a second portion of the textual identifier, andthe potential pronunciation is further based at least in part on the second potential language of origin.
  - 20. The non-transitory computer-readable storage medium of claim 16, further comprising program code to determine the potential pronunciation further based at least in part on a pronunciation history of a user.
  - 21. The non-transitory computer-readable storage medium of claim 20, wherein the pronunciation history of a user comprises a language spoken by the user.
  - 22. The non-transitory computer-readable storage medium of claim 16, further comprising program code to determine the potential language of origin further based at least in part on a language of origin of a second textual identifier associated with the textual identifier.
  - 23. The non-transitory computer-readable storage medium of claim 16, further comprising:
    - program code to receive audio data comprising an utterance;
      
      program code to identify the potential pronunciation in the utterance;
      
      program code to identify the textual identifier based on the stored association; and
      
      program code to retrieve at least a portion of a content item associated with the textual identifier.
  - 24. The non-transitory computer-readable storage medium of claim 16, wherein the textual identifier comprises a name of an artist, album, band, movie, book, song and/or food item to be accessed by the computing device.
  - 25. The non-transitory computer-readable storage medium of claim 16, wherein the potential spoken language associated with a location of a device of the system.
  - 26. The non-transitory computer-readable storage medium of claim 16, wherein the program code to determine the potential pronunciation of the textual identifier is based at least in part on a finite state transducer (FST) model, a maximum entropy model, a character level language model and/or a conditional random fields model.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Original Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Inventors
Adams, Jeffrey Penrod, Parlikar, Alok Ulhas, Lilly, Jeffrey Paul, Rastrow, Ariya

Granted Patent

US 10,339,920 B2
Time in Patent Office

Days
Field of Search
US Class Current

1/1
CPC Class Codes

G06F 40/263   Language identification

G06N 20/10   using kernel methods, e.g. ...

G06N 7/01   Probabilistic graphical mod...

G10L 13/08   Text analysis or generation...

G10L 15/08   Speech classification or se...

G10L 15/187   Phonemic context, e.g. pron...

G10L 2015/025   Phonemes, fenemes or fenone...

PREDICTING PRONUNCIATION IN SPEECH RECOGNITION

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

26 Claims

Specification

Solutions

Use Cases

Quick Links

PREDICTING PRONUNCIATION IN SPEECH RECOGNITION

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

26 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links