Speech recognition process

US 8,775,177 B1
Filed: 10/31/2012
Issued: 07/08/2014
Est. Priority Date: 03/08/2012
Status: Active Grant

First Claim

Patent Images

1. A method performed by one or more processing devices, comprising:

performing a preliminary recognition process on first audio, the preliminary recognition process comprising;

identifying one or more candidates for the first audio;

determining a plurality of path costs for the identified candidates, the plurality of path costs corresponding to sequences of sub-phonemes identified in the first audio;

determining a best path cost for each of the identified candidates based on the plurality of path costs;

associating the best path costs with the identified candidates; and

providing the identified candidates and associated best path costs;

generating first templates corresponding to the first audio, each first template comprising a number of elements corresponding to a sequence of sub-phonemes of the first audio;

selecting second templates corresponding to the identified candidates, the second templates representing second audio, each second template comprising elements that correspond to the elements in the first templates;

comparing the first templates to the second templates, wherein comparing comprises determining similarity metrics between the first templates and corresponding second templates, wherein the similarity metrics are based onexponentiated and scaled dynamic time warping (DTW) distances between the selected ones of the first templates and selected ones of the second templates;

applying weights to the similarity metrics to produce weighted similarity metrics, the weights being associated with corresponding second templates;

applying the weighted similarity metrics to corresponding best path costs to produce re-scored path costs, the re-scored path costs being associated with corresponding identified candidates; and

using the re-scored path costs to determine which of the identified candidates corresponds to the first audio.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A speech recognition process may perform the following operations: performing a preliminary recognition process on first audio to identify candidates for the first audio; generating first templates corresponding to the first audio, where each first template includes a number of elements; selecting second templates corresponding to the candidates, where the second templates represent second audio, and where each second template includes elements that correspond to the elements in the first templates; comparing the first templates to the second templates, where comparing comprises includes similarity metrics between the first templates and corresponding second templates; applying weights to the similarity metrics to produce weighted similarity metrics, where the weights are associated with corresponding second templates; and using the weighted similarity metrics to determine whether the first audio corresponds to the second audio.

Citations

16 Claims

1. A method performed by one or more processing devices, comprising:
- performing a preliminary recognition process on first audio, the preliminary recognition process comprising;
  
  identifying one or more candidates for the first audio;
  
  determining a plurality of path costs for the identified candidates, the plurality of path costs corresponding to sequences of sub-phonemes identified in the first audio;
  
  determining a best path cost for each of the identified candidates based on the plurality of path costs;
  
  associating the best path costs with the identified candidates; and
  
  providing the identified candidates and associated best path costs;
  
  generating first templates corresponding to the first audio, each first template comprising a number of elements corresponding to a sequence of sub-phonemes of the first audio;
  
  selecting second templates corresponding to the identified candidates, the second templates representing second audio, each second template comprising elements that correspond to the elements in the first templates;
  
  comparing the first templates to the second templates, wherein comparing comprises determining similarity metrics between the first templates and corresponding second templates, wherein the similarity metrics are based onexponentiated and scaled dynamic time warping (DTW) distances between the selected ones of the first templates and selected ones of the second templates;
  
  applying weights to the similarity metrics to produce weighted similarity metrics, the weights being associated with corresponding second templates;
  
  applying the weighted similarity metrics to corresponding best path costs to produce re-scored path costs, the re-scored path costs being associated with corresponding identified candidates; and
  
  using the re-scored path costs to determine which of the identified candidates corresponds to the first audio.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
- - 2. The method of claim 1, wherein selecting the second templates comprises selecting templates associated with a non-zero weight.
  - 3. The method of claim 1, wherein metadata is associated with at least one of the first audio and the second audio, the metadata being used in obtaining at least the second templates.
  - 4. The method of claim 3, wherein the metadata is indicative of the context of at least one of the first audio and the second audio.
  - 5. The method of claim 4, wherein the metadata indicates at least one word that neighbors a word in at least one of the first audio and the second audio.
  - 6. The method of claim 1, wherein the preliminary recognition process comprises a Hidden Markov Model (HMM) based process.
  - 7. The method of claim 1, wherein applying the weighted similarity metrics to corresponding best path costs to produce re-scored path costs comprises using a conditional random field technique to generate a composite score indicative of an extent to which the first audio corresponds to the second audio.
  - 8. The method of claim 1, wherein each element is at least one of:
    - a phoneme in context, a syllable, or a word.
  - 9. The method of claim 1, wherein, the first templates comprise vectors, the second templates comprise vectors, and the similarity metrics comprise distances between vectors.
  - 10. The method of claim 1, wherein the second templates comprise multiple groups of second templates, each group of second templates representing a different version of a same candidate word or phrase for at least one of the first and second audio.
  - 11. The method of claim 1, wherein second templates are selected from among a group of templates having associated weights, at least some of the weights being negative.
  - 12. The method of claim 1, wherein the weights are determined using a conditional random field technique.
  - 13. The method of claim 11, wherein at least some of the weights are zero, the zero weights being determined using a regularization technique.
  - 14. The method of claim 1, wherein metadata is associated with at least one of the first audio and the second audio, the metadata indicating at least one of:
    - information about a speaker of at least one of the first audio or the second audio, and information about an acoustic condition of at least one of the first audio or the second audio.

15. One or more non-transitory machine-readable media storing instructions that are executable to perform operations comprising:
- performing a preliminary recognition process on first audio, the preliminary recognition process comprising;
  
  identifying one or more candidates for the first audio;
  
  determining a plurality of path costs for the identified candidates, the plurality of path costs corresponding to sequences of sub-phonemes identified in the first audio;
  
  determining a best path cost for each of the identified candidates based on the plurality of path costs;
  
  associating the best path costs with the identified candidates; and
  
  providing the identified candidates and associated best path costs;
  
  generating first templates corresponding to the first audio, each first template comprising a number of elements corresponding to a sequence of sub-phonemes of the first audio;
  
  selecting second templates corresponding to the identified candidates, the second templates representing second audio, each second template comprising elements that correspond to the elements in the first templates;
  
  comparing the first templates to the second templates, wherein comparing comprises determining similarity metrics between the first templates and corresponding second templates, wherein the similarity metrics are based onexponentiated and scaled dynamic time warping (DTW) distances between the selected ones of the first templates and selected ones of the second templates;
  
  applying weights to the similarity metrics to produce weighted similarity metrics, the weights being associated with corresponding second templates; and
  
  applying the weighted similarity metrics to corresponding best path costs to produce re-scored bath costs, the re-scored bath costs being associated with corresponding identified candidates;
  
  using the re-scored path costs to determine which of the identified candidates corresponds to the first audio.

16. A system comprising:
- memory to store an acoustic model; and
  
  one or more processing devices to perform operations associated with the acoustic model, the acoustic model comprising;
  
  a first pass module to perform a preliminary recognition process on first audio, the preliminary recognition process comprising;
  
  identifying one or more candidates for the first audio;
  
  determining a plurality of path costs for the identified candidates, the plurality of path costs corresponding to sequences of sub-phonemes identified in the first audio;
  
  determining a best path cost for each of the identified candidates based on the plurality of path costs;
  
  associating the best path costs with the identified candidates; and
  
  providing the identified candidates and associated best path costs;
  
  a second pass module to;
  
  generate first templates corresponding to the first audio, each first template comprising a number of elements corresponding to a sequence of sub-phonemes of the first audio;
  
  select second templates corresponding to the identified candidates, the second templates representing second audio, each second template comprising elements that correspond to the elements in the first templates;
  
  compare the first templates to the second templates, wherein comparing comprises determining similarity metrics between the first templates and corresponding second templates, wherein the similarity metrics are basedexponentiated and scaled dynamic time warping (DTW) distances between the selected ones of the first templates and selected ones of the second templates;
  
  apply weights to the similarity metrics to produce weighted similarity metrics, the weights being associated with corresponding second templates;
  
  apply the weighted similarity metrics to corresponding best path costs to produce re-scored path costs, the re-scored path costs being associated with corresponding identified candidates; and
  
  use the re-scored path costs to determine which of the identified candidates corresponds to the first audio.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Weintraub, Mitchel, Heigold, Georg, Nguyen, Patrick An Phu, Vanhoucke, Vincent O.
Primary Examiner(s)
Chawan, Vijay B

Application Number

US13/665,245
Time in Patent Office

615 Days
Field of Search

704/251, 704/253, 704/256, 704/254, 704/243, 704/244, 704/500, 704/239, 704/270.1, 379/88.03, 379/88.22, 379/88.13, 379/88.26, 379/37, 707/999.102, 709/203, 705/53, 705/57, 705/80, 705/6, 715/716
US Class Current

704/243
CPC Class Codes

G10L 15/10 using distance or distortio...

G10L 2015/085 Methods for reducing search...

Speech recognition process

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

16 Claims

Specification

Solutions

Use Cases

Quick Links

Speech recognition process

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

16 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links