Acoustic model training using corrected terms

US 10,019,986 B2
Filed: 07/29/2016
Issued: 07/10/2018
Est. Priority Date: 07/29/2016
Status: Active Grant

First Claim

Patent Images

1. A method comprising:

receiving, from a client device and by a voice search system that includes (i) an automated speech recognizer that uses an acoustic model to transcribe utterances, (ii) a search engine, (iii) an acoustic model trainer that periodically retrains the acoustic model using portions of audio data that correspond to manually specified terms of first transcriptions, (iv) a user interface component, and (v) a correction classifier, first audio data corresponding to an utterance of a user;

obtaining, by the automated speech recognizer of the voice search system, a first transcription of the first audio data;

receiving, by the user interface component of the voice search system, data indicating (i) a selection of one or more terms of the first transcription and (ii) one or more of replacement terms that the user has manually specified as a replacement for the one or more terms;

determining, by the correction classifier of the voice search system, a minimum edit distance between the one or more terms of the first transcription and the one or more replacement terms;

determining, by the correction classifier of the voice search system and based at least on the minimum edit distance between the one or more terms of the first transcription and the one or more replacement terms that the user has manually specified as a replacement for the one or more terms, whether one or more of the replacement terms that the user has manually specified as a replacement for the one or more terms likely represent a correction of one or more of the one or more terms of the first transcription;

in response to determining, based at least on the minimum edit distance between the one or more terms of the first transcription and the one or more replacement terms that the user has manually specified as a replacement for the one or more terms, whether the one or more of the replacement terms that the user has manually specified as a replacement for the one or more terms likely represent a correction of the one or more terms of the first transcription, selectively retraining, by the acoustic model trainer of the voice search system, the acoustic model, comprising (i) retraining the acoustic model of the automated speech recognizer using a first portion of the audio that is associated with the one or more terms of the first transcription when the correction classifier indicates that the replacement terms likely represent a correction, or (ii) bypassing retraining of the acoustic model of the automated speech recognizer using the first portion of the first audio data that is associated with the one or more terms of the first transcription when the correction classifier indicates that the replacement terms do not likely represent a correction;

obtaining, by the automated speech recognizer of the voice search system and using the retrained acoustic model, a transcription of audio data corresponding to a subsequently received utterance; and

providing, by the user interface component of the voice search system, a user interface that includes one or more search results that the search engine of the voice search system has identified in response to the transcription of the audio data corresponding to the subsequently received utterance.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Methods, systems, and apparatus, including computer programs encoded on computer storage media, for speech recognition. One of the methods includes receiving first audio data corresponding to an utterance; obtaining a first transcription of the first audio data; receiving data indicating (i) a selection of one or more terms of the first transcription and (ii) one or more of replacement terms; determining that one or more of the replacement terms are classified as a correction of one or more of the selected terms; in response to determining that the one or more of the replacement terms are classified as a correction of the one or more of the selected terms, obtaining a first portion of the first audio data that corresponds to one or more terms of the first transcription; and using the first portion of the first audio data that is associated with the one or more terms of the first transcription to train an acoustic model for recognizing the one or more of the replacement terms.

17 Citations

18 Claims

1. A method comprising:
- receiving, from a client device and by a voice search system that includes (i) an automated speech recognizer that uses an acoustic model to transcribe utterances, (ii) a search engine, (iii) an acoustic model trainer that periodically retrains the acoustic model using portions of audio data that correspond to manually specified terms of first transcriptions, (iv) a user interface component, and (v) a correction classifier, first audio data corresponding to an utterance of a user;
  
  obtaining, by the automated speech recognizer of the voice search system, a first transcription of the first audio data;
  
  receiving, by the user interface component of the voice search system, data indicating (i) a selection of one or more terms of the first transcription and (ii) one or more of replacement terms that the user has manually specified as a replacement for the one or more terms;
  
  determining, by the correction classifier of the voice search system, a minimum edit distance between the one or more terms of the first transcription and the one or more replacement terms;
  
  determining, by the correction classifier of the voice search system and based at least on the minimum edit distance between the one or more terms of the first transcription and the one or more replacement terms that the user has manually specified as a replacement for the one or more terms, whether one or more of the replacement terms that the user has manually specified as a replacement for the one or more terms likely represent a correction of one or more of the one or more terms of the first transcription;
  
  in response to determining, based at least on the minimum edit distance between the one or more terms of the first transcription and the one or more replacement terms that the user has manually specified as a replacement for the one or more terms, whether the one or more of the replacement terms that the user has manually specified as a replacement for the one or more terms likely represent a correction of the one or more terms of the first transcription, selectively retraining, by the acoustic model trainer of the voice search system, the acoustic model, comprising (i) retraining the acoustic model of the automated speech recognizer using a first portion of the audio that is associated with the one or more terms of the first transcription when the correction classifier indicates that the replacement terms likely represent a correction, or (ii) bypassing retraining of the acoustic model of the automated speech recognizer using the first portion of the first audio data that is associated with the one or more terms of the first transcription when the correction classifier indicates that the replacement terms do not likely represent a correction;
  
  obtaining, by the automated speech recognizer of the voice search system and using the retrained acoustic model, a transcription of audio data corresponding to a subsequently received utterance; and
  
  providing, by the user interface component of the voice search system, a user interface that includes one or more search results that the search engine of the voice search system has identified in response to the transcription of the audio data corresponding to the subsequently received utterance.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method of claim 1, comprising:
    - generating a second transcription by replacing the one or more terms of the first transcription with one or more of the replacement terms;
      
      obtaining, from the search engine of the voice search system, search results responsive to the second transcription; and
      
      providing, for output, one or more of the search results.
  - 3. The method of claim 1, wherein determining the minimum edit distance comprises determining a phonetic distance between each of the one or more terms of the first transcription and each of the one or more of the replacement terms.
  - 4. The method of claim 1, wherein determining the minimum edit distance comprises determining one or more connections among the one or more of the terms of the first transcription.
  - 5. The method of claim 1, wherein determining the minimum edit distance comprises determining that the one or more of the terms of the first transcription are consecutive terms.
  - 6. The method of claim 1, wherein determining the minimum edit distance comprises determining that each of the one or more of the terms of the first transcription includes a threshold number of characters.
  - 7. The method of claim 1, wherein determining whether one or more of the replacement terms that the user has manually specified as a replacement for the one or more terms likely represent a correction of one or more of the one or more terms of the first transcription comprises determining that the replacement terms were likely selected in error.

8. A voice search system including (i) an automated speech recognizer that uses an acoustic model to transcribe utterances, (ii) a search engine, (iii) an acoustic model trainer that periodically retrains the acoustic model using portions of audio data that correspond to manually specified terms of first transcriptions, (iv) a user interface component, and (v) a correction classifier, the voice search system comprising:
- a processor configured to execute computer program instructions; and
  
  a computer storage medium encoded with the computer program instructions that, when executed by the processor, cause the system to perform operations comprising;
  
  receiving first audio data corresponding to an utterance of a user;
  
  receiving, from a client device, first audio data corresponding to an utterance of a user;
  
  obtaining, by the automated speech recognizer, a first transcription of the first audio data;
  
  receiving, by the user interface component, data indicating (i) a selection of one or more terms of the first transcription and (ii) one or more of replacement terms that the user has manually specified as a replacement for the one or more terms;
  
  determining, by the correction classifier, a minimum edit distance between the one or more terms of the first transcription and the one or more replacement terms;
  
  determining, by the correction classifier of the voice search system and based at least on the minimum edit distance between the one or more terms of the first transcription and the one or more replacement terms that the user has manually specified as a replacement for the one or more terms, whether one or more of the replacement terms that the user has manually specified as a replacement for the one or more terms likely represent a correction of one or more of the one or more terms of the first transcription;
  
  in response to determining, based at least on the minimum edit distance between the one or more terms of the first transcription and the one or more replacement terms that the user has manually specified as a replacement for the one or more terms, whether the one or more of the replacement terms that the user has manually specified as a replacement for the one or more terms likely represents a correction of the one or more terms of the first transcription, selectively retraining, by the acoustic model trainer of the voice search system, the acoustic model, comprising (i) retraining the acoustic model of the automated speech recognizer using a first portion of the audio that is associated with the one or more terms of the first transcription when the correction classifier indicates that the replacement terms likely represent a correction, or (ii) bypassing retraining of the acoustic model of the automated speech recognizer using the first portion of the first audio data that is associated with the one or more terms of the first transcription when the correction classifier indicates that the replacement terms do not likely represent a correction;
  
  obtaining, by the automated speech recognizer and using the retrained acoustic model, a transcription of audio data corresponding to a subsequently received utterance; and
  
  providing, by the user interface component, a user interface that includes one or more search results that the search engine of the voice search system has identified in response to the transcription of the audio data corresponding to the subsequently received utterance.
- View Dependent Claims (9, 10, 11, 12, 13, 14)
- - 9. The system of claim 8, wherein the operations comprise:
    - generating a second transcription by replacing the one or more terms of the first transcription with one or more of the replacement terms;
      
      obtaining, from the search engine of the voice search system, search results responsive to the second transcription; and
      
      providing, for output, one or more of the search results.
  - 10. The system of claim 8, wherein determining the minimum edit distance comprises determining a phonetic distance between each of the one or more terms of the first transcription and each of the one or more of the replacement terms.
  - 11. The system of claim 8, wherein determining the minimum edit distance comprises determining one or more connections among the one or more of the terms of the first transcription.
  - 12. The system of claim 8, wherein determining the minimum edit distance comprises determining that the one or more of the terms of the first transcription are consecutive terms.
  - 13. The system of claim 8, wherein determining the minimum edit distance comprises determining that each of the one or more of the terms of the first transcription includes a threshold number of characters.
  - 14. The system of claim 8, wherein determining whether one or more of the replacement terms that the user has manually specified as a replacement for the one or more terms likely represent a correction of one or more of the one or more terms of the first transcription comprises determining that the replacement terms were likely selected in error.

15. A computer-readable storage device encoded with a computer program, the computer program comprising instructions that, when executed by one or more computers, cause the one or more computers to perform operations comprising:
- receiving, from a client device and by a voice search system that includes (i) an automated speech recognizer that uses an acoustic model to transcribe utterances, (ii) a search engine, (iii) an acoustic model trainer that periodically retrains the acoustic model using portions of audio data that correspond to manually specified terms of first transcriptions, (iv) a user interface component, and (v) a correction classifier, first audio data corresponding to an utterance of a user;
  
  obtaining, by the automated speech recognizer of the voice search system, a first transcription of the first audio data;
  
  receiving, by the user interface component of the voice search system, data indicating (i) a selection of one or more terms of the first transcription and (ii) one or more of replacement terms that the user has manually specified as a replacement for the one or more terms;
  
  determining, by the correction classifier of the voice search system, a minimum edit distance between the one or more terms of the first transcription and the one or more replacement terms;
  
  determining, by the correction classifier of the voice search system and based at least on the minimum edit distance between the one or more terms of the first transcription and the one or more replacement terms that the user has manually specified as a replacement for the one or more terms, whether one or more of the replacement terms that the user has manually specified as a replacement for the one or more terms likely represent a correction of one or more of the one or more terms of the first transcription;
  
  in response to determining, based at least on the minimum edit distance between the one or more terms of the first transcription and the one or more replacement terms that the user has manually specified as a replacement for the one or more terms, whether the one or more of the replacement terms that the user has manually specified as a replacement for the one or more terms likely represents a correction of the one or more terms of the first transcription, selectively retraining, by the acoustic model trainer of the voice search system, the acoustic model, comprising (i) retraining the acoustic model of the automated speech recognizer using a first portion of the audio that is associated with the one or more terms of the first transcription when the correction classifier indicates that the replacement terms likely represent a correction, or (ii) bypassing retraining of the acoustic model of the automated speech recognizer using the first portion of the first audio data that is associated with the one or more terms of the first transcription when the correction classifier indicates that the replacement terms do not likely represent a correction;
  
  obtaining, by the automated speech recognizer of the voice search system and using the retrained acoustic model, a transcription of audio data corresponding to a subsequently received utterance; and
  
  providing, by the user interface component of the voice search system, a user interface that includes one or more search results that the search engine of the voice search system has identified in response to the transcription of the audio data corresponding to the subsequently received utterance.
- View Dependent Claims (16, 17, 18)
- - 16. The device of claim 15, wherein the operations comprise:
    - generating a second transcription by replacing the one or more terms of the first transcription with one or more of the replacement terms;
      
      obtaining, from the search engine of the voice search system, search results responsive to the second transcription; and
      
      providing, for output, one or more of the search results.
  - 17. The device of claim 15, wherein determining the minimum edit distance comprises determining a phonetic distance between each of the one or more terms of the first transcription and each of the one or more of the replacement terms.
  - 18. The device of claim 15, wherein determining the minimum edit distance comprises determining one or more connections among the one or more of the terms of the first transcription.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google LLC (Alphabet Inc.)
Inventors
Kapralova, Olga, Cherepanov, Evgeny A., Osmakov, Dmitry, Baeuml, Martin, Skobeltsyn, Gleb
Primary Examiner(s)
Yang, Qian

Application Number

US15/224,104
Publication Number

US 20180033426A1
Time in Patent Office

711 Days
Field of Search
US Class Current
CPC Class Codes

G10L 15/01   Assessment or evaluation of...

G10L 15/06   Creation of reference templ...

G10L 15/063   Training

G10L 15/10   using distance or distortio...

G10L 15/22   Procedures used during a sp...

G10L 15/32   Multiple recognisers used i...

G10L 2015/0635   updating or merging of old ...

G10L 2015/0638   Interactive procedures

G10L 2015/221   Announcement of recognition...

Acoustic model training using corrected terms

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

17 Citations

18 Claims

Specification

Solutions

Use Cases

Quick Links

Acoustic model training using corrected terms

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

17 Citations

18 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links