Systems and methods for combining subword detection and word detection for processing a spoken input

US 20030110035A1
Filed: 12/12/2001
Published: 06/12/2003
Est. Priority Date: 12/12/2001
Status: Active Grant

First Claim

Patent Images

1. A method for determining a plurality of hypothetical matches to a spoken input, comprising the computer-implemented steps of:

detecting subword units in the spoken input to generate a first set of hypothetical matches to the spoken input;

detecting words in the spoken input to generate a second set of hypothetical matches to the spoken input; and

combining the first set of hypothetical matches with the second set of hypothetical matches to produce a combined set of hypothetical matches to the spoken input, the combined set having a predefined number of hypothetical matches.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A computer-based detection (e.g. speech recognition) system combines a word decoder and subword decoder to detect words (or phrases) in a spoken input provided by a user into a speaker connected to the detection system. The word decoder detects words by comparing an input pattern (e.g., of hypothetical word matches) to reference patterns (e.g., words). The subword decoder compares an input pattern (e.g. hypothetical word matches based on subword or phoneme recognition) to reference patterns (e.g., words) based on a word pronunciation distance measure that indicates how close each input pattern is to matching each reference pattern. The word decoder and subword decoder each provide an N-best list of hypothetical matches to the spoken input. A list fusion module of the detection system selectively combines the two N-best lists to produce a final or combined N-best list.

Citations

28 Claims

1. A method for determining a plurality of hypothetical matches to a spoken input, comprising the computer-implemented steps of:
- detecting subword units in the spoken input to generate a first set of hypothetical matches to the spoken input;
  
  detecting words in the spoken input to generate a second set of hypothetical matches to the spoken input; and
  
  combining the first set of hypothetical matches with the second set of hypothetical matches to produce a combined set of hypothetical matches to the spoken input, the combined set having a predefined number of hypothetical matches.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The method of claim 1, wherein the step of detecting subword units includes detecting the subword units in the spoken input based on an acoustic model of the subword units and a language model of the subword units;
    - generating pattern comparisons between (i) an input pattern corresponding to the subword units in the spoken input and (ii) a source set of reference patterns based on a pronunciation dictionary, each generated pattern comparison based on the input pattern and one of the reference patterns; and
      
      generating the first set of the hypothetical matches by sorting the source set of reference patterns based on a closeness of each reference pattern to correctly matching the input pattern based on an evaluation of each generated pattern comparison, each evaluation determining a word pronunciation distance measure that indicates how close each input pattern is to matching each reference pattern.
  - 3. The method of claim 1, wherein the combined set of hypothetical matches is an ordered list comprising a highest ranking hypothetical match in the second set of hypothetical matches, followed by an ordered set of hypothetical matches based on the first set of hypothetical matches.
  - 4. The method of claim 1, wherein the combined set of hypothetical matches is an ordered list based on ranking confidence levels for each hypothetical match.
  - 5. The method of claim 1, wherein the subword units include at least one phoneme.
  - 6. The method of claim 1, wherein the hypothetical matches are words.

7. A computer system for determining a plurality of hypothetical matches to a spoken input, comprising:
- a subword decoder for detecting subword units in the spoken input to generate a first set of hypothetical matches to the spoken input;
  
  a word decoder detecting words in the spoken input to generate a second set of hypothetical matches to the spoken input; and
  
  a list fusion module for combining the first set of hypothetical matches with the second set of hypothetical matches to produce a combined set of hypothetical matches to the spoken input, the combined set having a predefined number of hypothetical matches.
- View Dependent Claims (8, 9, 10, 11, 12)
- - 8. The computer system of claim 7, wherein the subword decoder detects the subword units in the spoken input based on an acoustic model of the subword units and a language model of the subword units;
    - generates pattern comparisons between (i) an input pattern corresponding to the subword units in the spoken input and (ii) a source set of reference patterns based on a pronunciation dictionary, each generated pattern comparison based on the input pattern and one of the reference patterns; and
      
      generates the first set of the hypothetical matches by sorting the source set of reference patterns based on a closeness of each reference pattern to correctly matching the input pattern based on an evaluation of each generated pattern comparison, each evaluation determining a word pronunciation distance measure that indicates how close each input pattern is to matching each reference pattern.
  - 9. The computer system of claim 7, wherein the combined set of hypothetical matches is an ordered list comprising a highest ranking hypothetical match in the second set of hypothetical matches, followed by an ordered set of hypothetical matches based on the first set of hypothetical matches.
  - 10. The computer system of claim 7, wherein the combined set of hypothetical matches is an ordered list based on ranking confidence levels for each hypothetical match.
  - 11. The computer system of claim 7, wherein the subword units include at least one phoneme.
  - 12. The computer system of claim 7, wherein the hypothetical matches are words.

13. A computer program product comprising:
- a computer usable medium for determining a plurality of hypothetical matches to a spoken input; and
  
  a set of computer program instructions embodied on the computer useable medium, including instructions to;
  
  detect subword units in the spoken input to generate a first set of hypothetical matches to the spoken input;
  
  detect words in the spoken input to generate a second set of hypothetical matches to the spoken input; and
  
  combine the first set of hypothetical matches with the second set of hypothetical matches to produce a combined set of hypothetical matches to the spoken input, the combined set having a predefined number of hypothetical matches.

14. A method for determining a plurality of hypothetical matches to a spoken input by detecting subword units in the spoken input, comprising the computer-implemented steps of:
- detecting the subword units in the spoken input based on an acoustic model of the subword units and a language model of the subword units;
  
  generating pattern comparisons between (i) an input pattern corresponding to the subword units in the spoken input and (ii) a source set of reference patterns based on a pronunciation dictionary, each generated pattern comparison based on the input pattern and one of the reference patterns; and
  
  generating a set of the hypothetical matches by sorting the source set of reference patterns based on a closeness of each reference pattern to correctly matching the input pattern based on an evaluation of each generated pattern comparison, each evaluation determining a word pronunciation distance measure that indicates how close each input pattern is to matching each reference pattern.
- View Dependent Claims (15, 16, 17, 18, 19, 20)
- - 15. The method of claim 14, wherein the pattern comparisons are based on a confusion matrix that stores the likelihood of confusion between pairs of subword units, the likelihood of deleting each subword unit, and the likelihood of inserting each subword unit.
  - 16. The method of claim 15, further comprising a step of training the confusion matrix based on an output of an subword decoder, the output produced from an acoustic input of a training data set input to the subword decoder.
  - 17. The method of claim 15, further comprising the step of computing the confusion matrix by determining an entry in the confusion matrix for each unique subword unit that is in the set of reference patterns.
  - 18. The method of claim 14, wherein the step of detecting subword units is performed in a client computer, and the steps of generating pattern comparisons and generating the set of hypothetical matches are performed in a server computer.
  - 19. The method of claim 14, wherein the step of generating the set of hypothetical matches includes:
    - determining pairs of subword units by pairing an input subword unit from the input pattern with a reference subword unit from the reference pattern; and
      
      providing the word pronunciation distance measure by calculating a distance metric for each pair of subword units, the distance metric defined as follows;
      
      $\begin{matrix} S (p_{0}, d_{0}) = 0 \\ S (p_{i}, d_{j}) = \min {\begin{matrix} \begin{matrix} S (p_{i - 1}, d_{j - 1}) + C_{subs} (p_{i}, d_{j}) \\ S (p_{i - 1}, d_{j}) + C_{del} (p_{i}) \end{matrix} \\ S (p_{i}, d_{j - 1}) + C_{ins} (d_{j}) \end{matrix}} \\ S (P, D) = S (P_{n}, d_{m}) + LP (P_{n}, d_{m}) \end{matrix}$ wherein;
      
      S(P,D) is a distance between word P and D;
      
      P is a given input pattern, and D, a given reference pattern;
      
      S(p_i,d_j) is a score of the given input pattern matching a given subword unit p_iof P, and a given subword unit d_jof D;
      
      C_subs(p_i,d_j) is a cost of substituting the given subword unit p_iof P with the given subword unit d_jof D;
      
      C_del(p_i) is a cost of deleting the given subword unit p_iof P;
      
      C_ins(d_j) is a cost of inserting the given subword unit d_jof D;
      
      LP(p_n,d_m) is a length penalty of the given input pattern p_nmatching the given reference pattern d_m, n is the length of P, and m is the length of D;
      
      S(p_i−
      
      1, d_j−
      
      1) has a value of zero (0) if p_i−
      
      1, d_j−
      
      1is undefined;
      
      S(p_i−
      
      1,d_j) has the value of zero (0) if p_i-1,d_jis undefined;
      
      S(p_i,d_j−
      
      1) has the value of zero (0) if p_i,d_j−
      
      1, is undefined; and
      
      the distance metric for each pair of subword units is calculated in a sequence such that S(p_i−
      
      1, d_j−
      
      1), S(p_i−
      
      1, d_j), and S(p_i,d_j−
      
      1) are determined previously to determining S(p_i,d_j).
  - 20. The method of claim 14, wherein the subword units include at least one phoneme.

21. An computer system for determining a plurality of hypothetical matches to a spoken input by detecting subword units in the spoken input, comprising:
- a subword decoder for detecting the subword units in the spoken input based on an acoustic model of the subword units and a language model of the subword units; and
  
  a subword detection vocabulary look up module for generating pattern comparisons between (i) an input pattern corresponding to the subword units in the spoken input and (ii) a source set of reference patterns based on a pronunciation dictionary, each generated pattern comparison based on the input pattern and one of the reference patterns;
  
  the subword detection vocabulary look up module generating a set of the hypothetical matches by sorting the source set of reference patterns based on a closeness of each reference pattern to correctly matching the input pattern based on an evaluation of each generated pattern comparison, each evaluation determining a word pronunciation distance measure that indicates how close each input pattern is to matching each reference pattern.
- View Dependent Claims (22, 23, 24, 25, 26, 27)
- - 22. The computer system of claim 21, wherein the pattern comparisons are based on a confusion matrix that stores the likelihood of confusion between pairs of subword units, the likelihood of deleting each subword unit, and the likelihood of inserting each subword unit.
  - 23. The computer system of claim 22, wherein the confusion matrix is trained on an output of the subword decoder, the output produced from an acoustic input of a training data set to the subword decoder.
  - 24. The computer system of claim 22, wherein the confusion matrix is based on determining an entry in the confusion matrix for each unique subword unit that is in the set of reference patterns.
  - 25. The computer system of claim 21, wherein the subword decoder is part of a client computer, and the subword detection vocabulary look up module is part of a server computer.
  - 26. The computer system of claim 21, wherein the subword detection vocabulary look up module determines pairs of subword units by pairing an input subword unit from the input pattern with a reference subword unit from the reference pattern;
    - and provides the word pronunciation distance measure by calculating a distance metric for each pair of subword units, the distance metric defined as follows;
      
      $\begin{matrix} S (p_{0}, d_{0}) = 0 \\ S (p_{i}, d_{j}) = \min {\begin{matrix} \begin{matrix} S (p_{i - 1}, d_{j - 1}) + C_{subs} (p_{i}, d_{j}) \\ S (p_{i - 1}, d_{j}) + C_{del} (p_{i}) \end{matrix} \\ S (p_{i}, d_{j - 1}) + C_{ins} (d_{j}) \end{matrix}} \\ S (P, D) = S (P_{n}, d_{m}) + LP (P_{n}, d_{m}) \end{matrix}$ wherein;
      
      S(P,D) is a distance between word P and D;
      
      P is a given input pattern, and D, a given reference pattern;
      
      S(p_i,d_j) is a score of the given input pattern matching a given subword unit p_iof P, and a given subword unit d_jof D;
      
      C_subs(p_i, d_j) is a cost of substituting the given subword unit p_iof P with the given subword unit d_jof D;
      
      C_del(P_i) is a cost of deleting the given subword unit p_iof P;
      
      C_ins(d_j) is a cost of inserting the given subword unit d_jof D;
      
      LP(p_n,d_m) is a length penalty of the given input patterns p_nmatching the given reference pattern d_m, n is the length of P, and m is the length of D;
      
      S(p_i−
      
      1, d_j−
      
      1) has a value of zero (0) if p_i−
      
      1,d_j−
      
      1is undefined;
      
      S(p_i−
      
      1,d_j) has the value of zero (0) if p_i−
      
      1,d_jis undefined;
      
      S(p_i,d_j−
      
      1) has the value of zero (0) if (p_i,d_j−
      
      1)is undefined; and
      
      the distance metric for each pair of subword units is calculated in a sequence such that S(p_i−
      
      1,d_j−
      
      1), S(p_i−
      
      1, d_j), and S(p_i,d_j−
      
      1) are determined previously to determining S(p_i,d_j).
  - 27. The computer system of claim 21, wherein the subword units include at least one phoneme.

28. A computer program product comprising:
- a computer usable medium for determining a plurality of hypothetical matches to a spoken input by detecting subwords in the spoken input; and
  
  a set of computer program instructions embodied on the computer useable medium, including instructions to;
  
  detect the subword units in the spoken input based on an acoustic model of the subword units and a language model of the subword units;
  
  generate pattern comparisons between (i) an input pattern corresponding to the subword units in the spoken input and (ii) a source set of reference patterns based on a pronunciation dictionary, each generated pattern comparison based on the input pattern and one of the reference patterns; and
  
  generate a set of the hypothetical matches by sorting the source set of reference patterns based on a closeness of each reference pattern to correctly matching the input pattern based on an evaluation of each generated pattern comparison, each evaluation determining a word pronunciation distance measure that indicates how close each input pattern is to matching each reference pattern.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Hewlett-Packard Development Company, L.P. (HP Inc.)
Original Assignee
Compaq Information Technologies Group LP (HP Inc.)
Inventors
Thong, Jean-Manuel Van, Pusateri, Ernest

Granted Patent

US 6,985,861 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/254
CPC Class Codes

G10L 15/083 Recognition networks G10L15...

G10L 15/32 Multiple recognisers used i...

Systems and methods for combining subword detection and word detection for processing a spoken input

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

28 Claims

Specification

Solutions

Use Cases

Quick Links

Systems and methods for combining subword detection and word detection for processing a spoken input

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

28 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links