Method and apparatus for large vocabulary continuous speech recognition using a hybrid phoneme-word lattice

US 8,831,947 B2
Filed: 11/07/2010
Issued: 09/09/2014
Est. Priority Date: 11/07/2010
Status: Active Grant

First Claim

Patent Images

1. A method for extracting a term comprising at least one word from an audio signal captured in a call center environment, comprising:

receiving the audio signal captured in an environment;

extracting a multiplicity of vectors of spectrum-based features from the audio signal, wherein the spectrum-based features comprise at least any one of Mel Frequency Cepstral Coefficients (MFCC), Delta Cepstral Mel Frequency Coefficients (DMFCC), or spectral energy transform;

creating a phoneme lattice from the multiplicity of feature vectors, the phoneme lattice comprising at least one allophone, the at least one allophone comprising at least two phonemes and determined as most probable and correspondingly assigned a probability score;

creating a hybrid phoneme-word lattice from the phoneme lattice by utilizing a speech model and a non-speech model that were created from the audio signal captured in the call center environment; and

extracting the word by analyzing the hybrid phoneme-word lattice.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and apparatus combining the advantages of phonetic search such as the rapid implementation and deployment and medium accuracy, comprising steps and components for receiving the audio signal captured in the call center environment, extracting a multiplicity of feature vectors from the audio signal, creating a phoneme lattice from the multiplicity of feature vectors wherein the phoneme lattice comprising one or more allophone and each allophone comprising two or more phonemes, creating a hybrid phoneme-word lattice from the phoneme lattice and extracting the word by analyzing the hybrid phoneme-Word lattice.

102 Citations

View as Search Results

27 Claims

1. A method for extracting a term comprising at least one word from an audio signal captured in a call center environment, comprising:
- receiving the audio signal captured in an environment;
  
  extracting a multiplicity of vectors of spectrum-based features from the audio signal, wherein the spectrum-based features comprise at least any one of Mel Frequency Cepstral Coefficients (MFCC), Delta Cepstral Mel Frequency Coefficients (DMFCC), or spectral energy transform;
  
  creating a phoneme lattice from the multiplicity of feature vectors, the phoneme lattice comprising at least one allophone, the at least one allophone comprising at least two phonemes and determined as most probable and correspondingly assigned a probability score;
  
  creating a hybrid phoneme-word lattice from the phoneme lattice by utilizing a speech model and a non-speech model that were created from the audio signal captured in the call center environment; and
  
  extracting the word by analyzing the hybrid phoneme-word lattice.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
- - 2. The method of claim 1 wherein creating a phoneme lattice comprises performing Viterbi decoding on the feature vectors.
  - 3. The method of claim 1 wherein the audio signal captured in the call center environment comprises non-speech events that include any one of a silence, music or tones.
  - 4. The method of claim 1 wherein the speech model and the non-speech model are created by a method comprising:
    - recognizing speech and non-speech segments within the audio inputs;
      
      estimating an initial speech model and an initial non-speech model;
      
      normalizing the initial speech model or the initial non-speech model into a speech model or a non-speech model; and
      
      adapting the speech model or the non-speech model.
  - 5. The method of claim 1 wherein creating a phoneme lattice utilizes a joint multigram statistic model.
  - 6. The method of claim 1 wherein creating the hybrid phoneme-word lattice comprises performing a word beam search or stack/A* decoding on the phoneme lattice.
  - 7. The method of claim 1 wherein creating the hybrid phoneme-word lattice utilizes a contextual word sequence model.
  - 8. The method of claim 7 wherein the contextual word sequence model is generated by a method comprising:
    - performing domain based large vocabulary speech recognition of audio input;
      
      performing a Smoothing-Turing/Backoff-Katz/Kneser-Ney estimation; and
      
      performing context adaptation.
  - 9. The method of claim 8 further comprising performing at least one step selected from the group consisting of:
    - web adaptation;
      
      unsupervised adaptation;
      
      word confidence estimation; and
      
      multi-pass decoding.
  - 10. The method of claim 1 wherein analyzing the hybrid phoneme-word lattice comprises at least one step selected from the group consisting of:
    - text retrieval;
      
      word search;
      
      out-of-vocabulary word search;
      
      evaluation;
      
      error correction;
      
      meta data extraction; and
      
      N-best selection.
  - 11. The method of claim 1, wherein the spectrum-based features are acoustic features.
  - 12. The method of claim 1, wherein the spectrum-based features are non-acoustic features.
  - 13. The method of claim 1, wherein the hybrid phoneme-word lattice is created based on decoding of words from the phonemes lattice.
  - 14. The method of claim 1, wherein the extraction of a multiplicity of vectors of spectrum-based features is based on overlapping and non-aligned time frames of the audio signal.

15. An apparatus for extracting a term comprising an at least one word from an audio signal captured in a call center environment, comprising:
- a capture device for capturing the audio signal in an environment;
  
  a feature extraction component for extracting a multiplicity of vectors of spectrum-based features from the audio signal, wherein the spectrum-based features comprise at least any one of Mel Frequency Cepstral Coefficients (MFCC), Delta Cepstral Mel Frequency Coefficients (DMFCC), or spectral energy transform;
  
  an allophone decoding component for creating a phoneme lattice from the multiplicity of feature vectors, the phoneme lattice comprising at least one allophone, the at least one allophone comprising at least two phonemes and determined as most probable and correspondingly assigned a probability score;
  
  a word decoding component for creating a hybrid phoneme-word lattice from the phoneme lattice by utilizing a speech model and a non-speech model that were created from the audio signal captured in the call center environment; and
  
  an analysis component for analyzing the hybrid phoneme-word lattice.
- View Dependent Claims (16, 17, 18, 19, 20, 21, 22, 23, 24)
- - 16. The apparatus of claim 15 wherein the allophone decoding component comprises a Viterbi decoder.
  - 17. The apparatus of claim 15 wherein the allophone decoding component receives a speech model and anon-speech model.
  - 18. The apparatus of claim 15 wherein the allophone decoding component receives a joint multigram statistic model.
  - 19. The apparatus of claim 15 wherein the word decoding component receives a contextual word sequence model.
  - 20. The apparatus of claim 15 wherein the word decoding component comprises a word beam search component or a stack/A* decoding component.
  - 21. The apparatus of claim 15 further comprising a storage device for storing the phoneme lattice or the hybrid phoneme-word lattice.
  - 22. The apparatus of claim 15 wherein the analysis component comprises at least one component selected from the group consisting of:
    - a text retrieval component;
      
      a word search component;
      
      an out-of-vocabulary word search component;
      
      an evaluation component;
      
      an error correction component;
      
      a meta data extraction component; and
      
      an N-best selection component.
  - 23. The apparatus of claim 15, wherein the hybrid phoneme-word lattice is created based on decoding of words from the phonemes lattice.
  - 24. The apparatus of claim 15, wherein the extraction of a multiplicity of vectors of spectrum-based features is based on overlapping and non-aligned time frames of the audio signal.

25. A non-transitory computer readable storage medium containing a set of instructions for a general purpose computer, the set of instructions comprising:
- capturing an audio signal in a call center environment;
  
  extracting a multiplicity of vectors of spectrum-based features from the audio signal, wherein the spectrum-based features comprise at least any one of Mel Frequency Cepstral Coefficients (MFCC), Delta Cepstral Mel Frequency Coefficients (DMFCC), or spectral energy transform;
  
  creating a phoneme lattice from the multiplicity of feature vectors, the phoneme lattice comprising at least one allophone, the at least one allophone comprising at least two phonemes and determined as most probable and correspondingly assigned a probability score;
  
  creating a hybrid phoneme-word lattice from the phoneme lattice by utilizing a speech model and a non-speech model that were created from the audio signal captured in the call center environment; and
  
  analyzing the hybrid phoneme-word lattice.
- View Dependent Claims (26, 27)
- - 26. The non-transitory computer readable storage medium of claim 25, wherein the hybrid phoneme-word lattice is created based on decoding of words from the phonemes lattice.
  - 27. The non-transitory computer readable storage medium of claim 25, wherein the extraction of a multiplicity of vectors of spectrum-based features is based on overlapping and non-aligned time frames of the audio signal.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Nice Ltd
Original Assignee
Nice Systems Limited (Nice Ltd)
Inventors
Wasserblat, Moshe, Laperdon, Ronen, Shapira, Dori
Primary Examiner(s)
PULLIAS, JESSE SCOTT

Application Number

US12/941,057
Publication Number

US 20120116766A1
Time in Patent Office

1,402 Days
Field of Search

704231-257, 704/270, 704/270.1, 704E17001-E1505
US Class Current

704/255
CPC Class Codes

G10L 15/08 Speech classification or se...

Method and apparatus for large vocabulary continuous speech recognition using a hybrid phoneme-word lattice

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

102 Citations

27 Claims

Specification

Solutions

Use Cases

Quick Links

Method and apparatus for large vocabulary continuous speech recognition using a hybrid phoneme-word lattice

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

102 Citations

27 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links