Acoustic model training

US 9,495,955 B1
Filed: 01/02/2013
Issued: 11/15/2016
Est. Priority Date: 01/02/2013
Status: Active Grant

First Claim

Patent Images

1. An acoustic modeling system, comprising:

under control of one or more computing devices configured with specific computer-executable instructions,receiving a plurality of characteristics of utterances to be used to create an acoustic model;

for each characteristic in the plurality of characteristics;

identifying an utterance within a corpus of utterances having the characteristic; and

associating at least a portion of the utterance with a tag indicative of the characteristic;

receiving an identification of a desired training utterance, wherein the desired training utterance comprises a first portion associated with a first desired characteristic and a second portion associated with a second desired characteristic, and wherein the desired training utterance is not included in the corpus;

selecting, from the corpus, a first utterance,wherein a portion of the first utterance comprises at least the first portion of the desired training utterance, andwherein the portion of the first utterance is associated with a tag corresponding to the first desired characteristic;

extracting the portion of the first utterance from the first utterance;

selecting, from the corpus, a second utterance,wherein a portion of the second utterance comprises at least the second portion of the desired training utterance, andwherein the portion of the second utterance is associated with a tag corresponding to the second desired characteristic;

extracting the portion of the second utterance from the second utterance;

concatenating the portion of the first utterance with the portion of the second utterance to generate the desired training utterance; and

training an acoustic model, wherein;

the acoustic model comprises statistical representations of possible sounds of subword units; and

the statistical representations are generated based on a comparison between audio data associated with the desired training utterance that is generated and a textual transcription of the desired training utterance that is generated.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Features are disclosed for generating acoustic models from an existing corpus of data. Methods for generating the acoustic models can include receiving at least one characteristic of a desired acoustic model, selecting training utterances corresponding to the characteristic from a corpus comprising audio data and corresponding transcription data, and generating an acoustic model based on the selected training utterances.

40 Citations

View as Search Results

26 Claims

1. An acoustic modeling system, comprising:
- under control of one or more computing devices configured with specific computer-executable instructions,receiving a plurality of characteristics of utterances to be used to create an acoustic model;
  
  for each characteristic in the plurality of characteristics;
  
  identifying an utterance within a corpus of utterances having the characteristic; and
  
  associating at least a portion of the utterance with a tag indicative of the characteristic;
  
  receiving an identification of a desired training utterance, wherein the desired training utterance comprises a first portion associated with a first desired characteristic and a second portion associated with a second desired characteristic, and wherein the desired training utterance is not included in the corpus;
  
  selecting, from the corpus, a first utterance,wherein a portion of the first utterance comprises at least the first portion of the desired training utterance, andwherein the portion of the first utterance is associated with a tag corresponding to the first desired characteristic;
  
  extracting the portion of the first utterance from the first utterance;
  
  selecting, from the corpus, a second utterance,wherein a portion of the second utterance comprises at least the second portion of the desired training utterance, andwherein the portion of the second utterance is associated with a tag corresponding to the second desired characteristic;
  
  extracting the portion of the second utterance from the second utterance;
  
  concatenating the portion of the first utterance with the portion of the second utterance to generate the desired training utterance; and
  
  training an acoustic model, wherein;
  
  the acoustic model comprises statistical representations of possible sounds of subword units; and
  
  the statistical representations are generated based on a comparison between audio data associated with the desired training utterance that is generated and a textual transcription of the desired training utterance that is generated.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The system of claim 1, wherein the plurality of characteristics comprises at least one of:
    - phrase type characteristics, speaker characteristics, accent characteristics, pitch characteristics, intonation characteristics, emotion characteristics, subword unit characteristics, or background noise level characteristics.
  - 3. The system of claim 1, wherein the computing device is configured to select the first utterance from the corpus using natural language processing.
  - 4. The system of claim 1, wherein a desired characteristic is an indication of a word or subword unit.
  - 5. The system of claim 1, wherein the corpus comprises data from at least one of audiobooks, movie soundtracks, customer service calls, or speeches.

6. A computer-implemented method, comprising:
- under control of one or more computing devices configured with specific computer-executable instructions,receiving an identification of a desired training utterance having a first portion and a second portion, wherein the desired training utterance is not included in a corpus;
  
  selecting, from the corpus, a first utterance comprising at least a first portion of the desired training utterance;
  
  extracting, from the first utterance, a portion of the first utterance comprising the first portion of the desired training utterance, wherein the first portion of the desired training utterance and at least the portion of the first utterance are associated with a first desired characteristic;
  
  selecting, from the corpus, a second utterance comprising at least a second portion of the desired training utterance;
  
  extracting, from the second utterance, a portion of the second utterance comprising the second portion of the desired training utterance, wherein the second portion of the desired training utterance and at least the portion of the second utterance are associated with a second desired characteristic;
  
  concatenating at least the portion of the first utterance and the portion of the second utterance to generate the desired training utterance; and
  
  training an acoustic model using the desired training utterance that is generated.
- View Dependent Claims (7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
- - 7. The computer-implemented method of claim 6, wherein the first desired characteristic is at least one of a phrase type characteristic, an accent characteristic, a pitch characteristic, an intonation characteristic, an emotion characteristic, a subword unit characteristic, or a background noise level characteristic.
  - 8. The computer-implemented method of claim 6, wherein the corpus comprises data from one or more of audiobooks, movie soundtracks, customer service calls, or speeches.
  - 9. The computer-implemented method of claim 6, wherein training the acoustic model using the desired training utterance that is generated comprises creating a new acoustic model or adapting an existing acoustic model.
  - 10. The computer-implemented method of claim 6, wherein the corpus was previously tagged with the first desired characteristic and the second desired characteristic.
  - 11. The computer-implemented method of claim 6, wherein selecting the first utterance comprising the first portion corresponding to the first desired characteristic from the corpus comprises identifying data from the corpus having the first desired characteristic by analyzing the corpus data.
  - 12. The computer-implemented method of claim 6, wherein the first desired characteristic comprises an utterance or portion of an utterance identified as having a low confidence value based on speech recognition results.
  - 13. The computer-implemented method of claim 6, wherein the first desired characteristic is one of a question utterance or a command utterance.
  - 14. The computer-implemented method of claim 6, wherein the first desired characteristic is one of a gender, an age category, or an accent.
  - 15. The computer-implemented method of claim 6, wherein training an acoustic model using the desired training utterance that is generated comprises at least one of generating, creating, configuring, updating, or adapting the acoustic model using the desired training utterance that is generated.
  - 16. The computer-implemented method of claim 6, wherein:
    - the desired training utterance that is generated comprises audio data and transcription data corresponding to the audio data of the desired training utterance that is generated; and
      
      concatenating at least the first portion of the first utterance and the portion of the second utterance to generate the desired training utterance comprises;
      
      creating the audio data of the desired training utterance that is generated from audio data of the first portion of the first utterance and audio data of the portion of the second utterance; and
      
      creating the transcription data of the desired training utterance that is generated from transcription data of the portion of the first utterance and transcription data of the portion of the second utterance.
  - 17. The computer-implemented method of claim 16, wherein training the acoustic model using the desired training utterance that is generated comprises generating statistical representations of sounds of subword units associated with the desired training utterance that is generated based on a comparison of the audio data of the desired training utterance that is generated with the transcription data of the desired training utterance that is generated.

18. A system comprising:
- an electronic data store configured to store a corpus of audio data and corresponding transcription data; and
  
  at least one computing device in communication with the electronic data store and configured to;
  
  receive an identification of a desired training utterance having a first portion and a second portion, wherein the desired training utterance is not included in the corpus;
  
  select, from the corpus, a first utterance comprising at least a first portion of the desired training utterance;
  
  extract, from the first utterance, a portion of the first utterance comprising the first portion of the desired training utterance, wherein the first portion of the desired training utterance and at least the portion of the first utterance are associated with a first desired characteristic;
  
  select, from the corpus, a second utterance comprising a second portion of the desired training utterance;
  
  extract, from the second utterance, a portion of the second utterance comprising the second portion of the desired training utterance, wherein the second portion of the desired training utterance and at least the portion of the second utterance are associated with a second desired characteristic;
  
  concatenating at least the portion of the first utterance and the portion of the second utterance to generate the desired training utterance; and
  
  training an acoustic model using the desired training utterance that is generated.
- View Dependent Claims (19, 20, 21)
- - 19. The system of claim 18, wherein the first desired characteristic comprises a question or command utterance.
  - 20. The system of claim 18, wherein the first desired characteristic comprises a gender of the speaker or an age of the speaker.
  - 21. The system of claim 18, wherein the first desired characteristic is that the utterance comprises a particular subword unit.

22. A non-transitory computer-readable medium comprising one or more computer-executable modules, the one or more computer-executable modules configured to:
- receive an identification of a desired training utterance having a first portion and a second portion, wherein the desired training utterance is not included in a corpus;
  
  select, from the corpus, a first utterance comprising at least a first portion of the desired training utterance;
  
  extract, from the first utterance, a portion of the first utterance comprising the first portion of the desired training utterance, wherein the first portion of the desired training utterance and at least the portion of the first utterance are associated with a first desired characteristic;
  
  select, from the corpus, a second utterance comprising at least a second portion of the desired training utterance;
  
  extract, from the second utterance, a portion of the second utterance comprising the second portion of the desired training utterance, wherein the second portion of the desired training utterance and at least the portion of the second utterance are associated with a second desired characteristic;
  
  concatenating at least the portion of the first utterance the portion of the second utterance to generate the desired training utterance; and
  
  train an acoustic model using the desired training utterance that is generated.
- View Dependent Claims (23, 24, 25, 26)
- - 23. The non-transitory computer-readable medium of claim 22, wherein the first desired characteristic is at least one of:
    - a phrase type, an accent characteristic, a pitch characteristic, an intonation characteristic, an emotion characteristic, a phonetic context characteristic, or a background noise level characteristic.
  - 24. The non-transitory computer-readable medium of claim 22, wherein the first desired characteristic comprises an identification of a query utterance.
  - 25. The non-transitory computer-readable medium of claim 22, wherein the first desired characteristic is a gender of the speaker or an age category of the speaker.
  - 26. The non-transitory computer-readable medium of claim 22, wherein the corpus comprises utterances from at least one of audiobooks, movie soundtracks, customer service calls or speeches.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Original Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Inventors
Weber, Frederick Victor, Adams, Jeffrey Penrod
Primary Examiner(s)
Shah, Paras D

Application Number

US13/733,084
Time in Patent Office

1,413 Days
Field of Search

704/231, 704/236, 704/239, 704/246, 704/251, 704/243, 704/244, 932/31, 932/36, 932/39, 932/46, 932/51, 932/43, 932/44
US Class Current

1/1
CPC Class Codes

G10L 15/063 Training

Acoustic model training

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

40 Citations

26 Claims

Specification

Use Cases

Quick Links

Others

Acoustic model training

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

40 Citations

26 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others