Efficient Empirical Determination, Computation, and Use of Acoustic Confusability Measures

US 20080126089A1
Filed: 10/31/2007
Published: 05/29/2008
Est. Priority Date: 10/31/2002
Status: Active Grant

First Claim

Patent Images

1. A computer implemented method for efficient empirical determination, computation, and use of an acoustic confusability measure, comprising the steps of:

empirically deriving an acoustic confusability measure by determining acoustic confusability between at least any two textual phrases in a given language;

wherein said measure of acoustic confusability is empirically derived from examples of application of utterances to a specific speech recognition application;

iterating from an initial estimate of said acoustic confusability measure to improve said measure; and

using said acoustic confusability measure to make principled choices about which specific phrases to make recognizable by said speech recognition application.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Efficient empirical determination, computation, and use of an acoustic confusability measure comprises: (1) an empirically derived acoustic confusability measure, comprising a means for determining the acoustic confusability between any two textual phrases in a given language, where the measure of acoustic confusability is empirically derived from examples of the application of a specific speech recognition technology, where the procedure does not require access to the internal computational models of the speech recognition technology, and does not depend upon any particular internal structure or modeling technique, and where the procedure is based upon iterative improvement from an initial estimate; (2) techniques for efficient computation of empirically derived acoustic confusability measure, comprising means for efficient application of an acoustic confusability score, allowing practical application to very large-scale problems; and (3) a method for using acoustic confusability measures to make principled choices about which specific phrases to make recognizable by a speech recognition application.

Citations

18 Claims

1. A computer implemented method for efficient empirical determination, computation, and use of an acoustic confusability measure, comprising the steps of:
- empirically deriving an acoustic confusability measure by determining acoustic confusability between at least any two textual phrases in a given language;
  
  wherein said measure of acoustic confusability is empirically derived from examples of application of utterances to a specific speech recognition application;
  
  iterating from an initial estimate of said acoustic confusability measure to improve said measure; and
  
  using said acoustic confusability measure to make principled choices about which specific phrases to make recognizable by said speech recognition application.

2. A computer implemented method for determining an empirically derived acoustic confusability measure, comprising the steps of:
- performing corpus processing by passing an original corpus through an automatic speech recognition system of interest once, one utterance at a time; and
  
  developing a family of phoneme confusability models by repeatedly passing over said recognized corpus, analyzing each pair of phoneme sequences to collect information regarding the confusability of any two phonemes, at each step delivering an improved family of confusability models.
- View Dependent Claims (3, 4, 5, 6)
- - 3. The method of claim 2, said corpus processing step comprising the steps of:
    - for input each utterance, said recognition system generating both a decoding, in a decoded frame sequence, wherein a frame comprises a brief audio segment of the input utterance, and a confidence score which comprises a measure, determined by said recognition system, of the likelihood that a given decoding is correct;
      
      transforming said decoded frame sequence into a shorter decoded phoneme sequence;
      
      inspecting a true transcription of said input utterance;
      
      transforming said true transcription into a true phoneme sequence;
      
      for each utterance outputting, collectively, a recognized corpus containing a large number of pairs of phoneme sequences, and comprising a confidence score and a pair of phoneme sequences which comprise the decoded phoneme sequence and the true phoneme sequence.
  - 4. The method of claim 2, said developing step comprising the steps of:
    - iterating until there is no further change in a family of probability models, or the change becomes negligible;
      
      outputting a family of probability models, which estimates acoustic confusability of any two members a augmented phoneme alphabet; and
      
      deriving an acoustic confusability measure.
  - 5. The method of claim 2, said corpus comprising:
    - a representative set of utterances, in a given human language, with associated reliable transcriptions;
      
      wherein an utterance comprises a sound recording, represented in a suitable computer-readable form;
      
      wherein a transcription comprises a conventional textual representation of said utterance; and
      
      wherein a reliable transcription is a transcription that may be regarded as accurate.
  - 6. The method of claim 2,wherein said decoded frame sequence comprises said recognizer'"'"'s best guess, for each frame of an utterance, of a phoneme being enunciated, in that audio frame;
    - andwherein a phoneme comprises one of a finite number of basic sound units of a human language.

7. In a computer implemented method for determining an empirically derived acoustic confusability measure, a corpus processing method comprising the steps of:
- receiving an input utterance;
  
  performing corpus processing by passing an original corpus through an automatic speech recognition system of interest once, one utterance at a time;
  
  wherein said corpus comprises pairs of utterances and transcriptions, and for each pair in said corpus;
  
  applying a recognizer to the input utterance, yielding as an output a decoded frame sequence and a confidence score;
  
  coalescing identical sequential phonemes in said decoded frame sequence to obtain a decoded phoneme sequence by replacing each subsequence of identical contiguous phonemes that appear in said sequence by a single phoneme of a same type;
  
  generating a pronunciation of a transcription of an utterance by lookup in a dictionary of the automatic recognition system, or by use of an automatic pronunciation generation system; and
  
  applying said steps sequentially to each element of said corpus to obtain a recognized corpus.
- View Dependent Claims (8, 9, 13, 14)
- - 8. The method of claim 7, further comprising the step of:
    - mapping phonemes by applying a phoneme simplification map to each element of said decoded frame sequence, yielding a new decoded frame sequence.
  - 9. The method of claim 7, wherein if there is more than one valid pronunciation for a transcription, performing any of the steps of:
    - picking a most popular pronunciation, if this is known;
      
      picking a pronunciation at random;
      
      using all of pronunciations, by enlarging said corpus to contain as many repetitions of an utterance as there are pronunciations of in the transcript, and pairing each distinct pronunciation with a separate instance of the utterance; and
      
      picking pronunciation that is closest, in the sense of string edit distance, to the decoded phoneme sequence.
  - 13. The method of claim 7, further comprising the step of:
    - performing N-best variant corpus processing wherein, for each utterance, each entry in the N-best list is treated as a separate decoding.
  - 14. The method of claim 7, further comprising the step of:
    - performing N-best variant iterative development of said probability model family by;
      
      performing N-best variant corpus processing wherein, for each utterance, each entry in the N-best list is treated as a separate decoding; and
      
      when processing a given entry, each count is incremented by a confidence score of the given entry, rather than by 1.

10. In a computer implemented method for determining an empirically derived acoustic confusability measure, an iterative method for development of a probability model family, comprising the steps of:
- providing a recognized corpus;
  
  establishing a termination condition which depends on one or more of;
  
  a number of iterations executed;
  
  closeness of match between a previous and current probability family models;
  
  oranother consideration;
  
  defining a family of decoding costs;
  
  setting an iteration count to 0.setting a phoneme pair count to 0;
  
  for each entry in the recognized corpus, performing the following steps;
  
  constructing a lattice;
  
  populating said lattice arcs with values drawn from a current family of decoding costs;
  
  applying a Bellman-Ford dynamic programming algorithm, or a Dijkstra'"'"'s shortest path first algorithm, to find a shortest path through said lattice, from a source node to a terminal node; and
  
  traversing said determined shortest path, wherein for each arc that is traversed, the phoneme pair count is incremented by 1.for each transcription, computing a confidence score which is the sum of a phoneme pair value over all transcriptions paired with an utterance;
  
  estimating a family of probability models;
  
  if the iteration count >
  
  0, testing a termination condition;
  
  if said termination condition is satisfied, returning a desired probability model family and stopping;
  
  if said termination condition is not satisfied, defining a new family of decoding costs;
  
  incrementing said iteration count and repeating.
- View Dependent Claims (11, 12)
- - 11. The method of claim 10, said step of estimating a family of probability models comprising either of the steps of:
    - if the confidence value is non-zero for every transcription then setting the probability to a ratio of confidence for a phoneme pair over confidence for said utterance;
      
      if the confidence value is zero for any transcription, then applying a desired zero-count probability estimator to estimate probability.
  - 12. The method of claim 10, the step of constructing a lattice comprising the steps of:
    - for an entry in the recognized corpus with a decoded phoneme sequence containing N phonemes, and a true phoneme sequence containing Q phonemes, constructing a rectangular lattice of dimension (N+1) rows by (Q+1) columns, and with an arc from a node (i, j) to each of nodes (i+1, j), (i, j+1), and (i+1, j+1), when present in said lattice, where “
      
      node (i, j)”
      
      refers to the node in row i, column j of the lattice;
      
      labeling;
      
      each arc (i, j)□
      
      (i, j+1) with the cost δ
      
      _(i)(□
      
      |t_j)each arc (i, j)□
      
      (i+1, j) with the cost δ
      
      _(i)(d_i|□
      
      )each arc (i, j)□
      
      (i+1, j+1) with the cost δ
      
      _(i)(d_i|t_j).applying the Bellman-Ford dynamic programming algorithm or Dijkstra'"'"'s shortest path first algorithm to find a shortest path from the source node, which we define as node (0, 0), to the terminal node, which we define as node (N, Q);
      
      determining a minimum cost path from the source node to each node of the lattice repeated application of the foregoing steps;
      
      outputting a sequence of arcs A=a₁, a₂, . . . , a_K, in said lattice that are known to comprise the minimum cost path from the source node to the terminal node;
      
      for each arc a_iin the minimum cost path A, labeled with a phoneme pair, incrementing the phoneme pair counter by 1.

15. A method for computing an empirically derived acoustic confusability of two phrases, comprising the steps of:
- determining a desired probability model family □
  
  ;
  
  using □
  
  to compute acoustic confusability of two arbitrary phrases w and v by;
  
  computing a raw phrase acoustic confusability measure, which is a measure of the acoustic similarity of phrases v and w; and
  
  computing a grammar-relative confusion probability measure, which is an estimate of the probability that a grammar-constrained recognizer returns the phrase v as a decoding, when a true phrase is w.
- View Dependent Claims (16, 17, 18)
- - 16. The method of claim 15, said step of computing a phrase acoustic confusability measure further comprising the steps of:
    - given pronunciations q(w) and q(v), computing the raw pronunciation acoustic confusability by;
      
      defining decoding costs for each phoneme;
      
      constructing a lattice L=q(v)×
      
      q(w), and labeling it with said phoneme decoding costs, depending upon the phonemes of q(v) and q(w);
      
      finding a minimum cost path A=a₁, a₂, . . . , a_K, from a source node to a terminal node of L;
      
      computing a cost of a minimum cost path A, as a sum of the decoding costs for each arc aε
      
      ; and
      
      computing a raw pronunciation acoustic confusability measure of q(v) and q(w).
  - 17. The method of claim 15, further comprising the steps of:
    - computing a phrase acoustic confusability measure with no reference to pronunciations by any one of the following;
      
      worst case;
      
      most common;
      
      average case;
      
      random; and
      
      a combination of the worst case, most common, average case, and random methods into additional hybrid variants.
  - 18. The method of claim 15, said step of computing a grammar-relative pronunciation confusion probability comprising the steps of:
    - letting L(G) be a set of all phrases admissible by a grammar G, and letting Q(L(G)) be a set of all pronunciations of all such phrases;
      
      letting two pronunciations q(v), q(w)ε
      
      Q(L(G)) be given;
      
      estimating a probability that an utterance corresponding to a pronunciation q(w) is decoded by a recognizer R_Gas q(v), as follows;
      
      computing a normalizer of q(w) relative to G, written Z(q(w), G), as Z(q(w), G)=Σ
      
      r(q(x)|q(w)), where the sum extends over all q(x)ε
      
      Q(L(G)); and
      
      setting a probability p(q(v)|q(w), G)=r(q(v)|q(w))/Z(q(w), G).

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Promptu Systems Corporation
Original Assignee
Promptu Systems Corporation
Inventors
Chittar, Narren, Printz, Harry

Granted Patent

US 8,959,019 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/235
CPC Class Codes

G06F 16/95   Retrieval from the web

G06F 16/9535   Search customisation based ...

G06Q 30/02   Marketing; Price estimation...

G10L 15/02   Feature extraction for spee...

G10L 15/142   Hidden Markov Models [HMMs]

G10L 15/18   using natural language mode...

G10L 15/187   Phonemic context, e.g. pron...

G10L 15/22   Procedures used during a sp...

G10L 17/26   Recognition of special voic...

G10L 2015/025   Phonemes, fenemes or fenone...

Efficient Empirical Determination, Computation, and Use of Acoustic Confusability Measures

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

18 Claims

Specification

Solutions

Use Cases

Quick Links

Efficient Empirical Determination, Computation, and Use of Acoustic Confusability Measures

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

18 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links