System and method for computing and transmitting parameters in a distributed voice recognition system

US 20030004720A1
Filed: 01/28/2002
Published: 01/02/2003
Est. Priority Date: 01/30/2001
Status: Abandoned Application

First Claim

Patent Images

1. In a voice recognition system comprising a front end and a back end a feature extraction module, comprising:

a processing sub-module; and

a feature extraction sub-module communicatively coupled to said processing sub-module;

wherein a digital signal provided from said processing sub-module is downsampled in a downsampling module.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system and method for extracting acoustic features and speech activity on a device and transmitting them in a distributed voice recognition system. The distributed voice recognition system includes a local VR engine in a subscriber unit and a server VR engine on a server . The local VR engine comprises a feature extraction (FE) module that extracts features from a speech signal, and a voice activity detection module (VAD) that detects voice activity within a speech signal. The system includes filters, framing and windowing modules, power spectrum analyzers, a neural network, a nonlinear element, and other components to selectively provide an advanced front end vector including predetermined portions of the voice activity detection indication and extracted features from the subscriber unit to the server . The system also includes a module to generate additional feature vectors on the server from the received features using a feed-forward multilayer perceptron (MLP) and providing the same to the speech server.

95 Citations

View as Search Results

109 Claims

1. In a voice recognition system comprising a front end and a back end a feature extraction module, comprising:
- a processing sub-module; and
  
  a feature extraction sub-module communicatively coupled to said processing sub-module;
  
  wherein a digital signal provided from said processing sub-module is downsampled in a downsampling module.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58)
- - 2. The voice recognition system as claimed in claim 1, wherein said downsampling module is disposed in said feature extraction sub-module.
  - 3. The voice recognition system as claimed in claim 2 further comprising:
    - a first filter module communicatively coupled to said processing sub-module and said downsampling module.
  - 4. The voice recognition system as claimed in claim 3 wherein said first filter module is configured to perform filtering in accordance with linear discriminant analysis.
  - 5. The voice recognition system as claimed in claim 2, further comprising:
    - a first transformation module communicatively coupled to said downsampling module; and
      
      a normalization module communicatively coupled to said first transformation module.
  - 6. The voice recognition system as claimed in claim 5 wherein said first transformation module is configured to perform discrete cosine transform.
  - 7. The voice recognition system as claimed in claim 5, further comprising:
    - a bitstream processor communicatively coupled to said normalization module.
  - 8. The voice recognition system as claimed in claim 7, further comprising:
    - a compressor module communicatively coupled to said normalization module and said bitstream processor.
  - 9. The voice recognition system as claimed in claim 1, wherein said processing sub-module comprises:
    - a framing module;
      
      a windowing module communicatively coupled to said framing module;
      
      a second transformation module communicatively coupled to said windowing module;
      
      a power spectrum module communicatively coupled to said transform module;
      
      a second filter module communicatively coupled to said power spectrum module; and
      
      a third transformation module communicatively coupled to said second filter module.
  - 10. The voice recognition system as claimed in claim 9, wherein said framing module is configured to:
    - accept speech signal; and
      
      provide a frame of the speech signal.
  - 11. The voice recognition system as claimed in claim 9, wherein said windowing module is configured to perform windowing by Hamming function.
  - 12. The voice recognition system as claimed in claim 9, wherein said second transformation module is configured to perform a fourier transform.
  - 13. The voice recognition system as claimed in claim 9, wherein said power spectrum module is configured to perform a power spectrum determination.
  - 14. The voice recognition system as claimed in claim 9, wherein said second filter module is configured to perform a MEL filtering.
  - 15. The voice recognition system as claimed in claim 9, wherein said third transformation module is configured to perform a non-linear transformation.
  - 16. The voice recognition system as claimed in claim 15, wherein said non-linear transformation is logarithmic transformation.
  - 17. The voice recognition system as claimed in claim 1, wherein said feature extraction module is disposed in said front end.
  - 18. The voice recognition system as claimed in claim 17, wherein said front end is disposed in a subscriber terminal.
  - 20. The voice recognition system as claimed in claim 19, wherein said downsampling module is disposed in said voice activity detection sub-module.
  - 21. The voice recognition system as claimed in claim 20, further comprising:
    - a first transformation module communicatively coupled to said downsampling module;
      
      an estimation module communicatively coupled to said transformation module;
      
      a threshold detector communicatively coupled to said estimation module;
      
      a first filter module communicatively coupled to said threshold detector.
  - 22. The voice recognition system as claimed in claim 21 wherein said first transformation module is configured to perform discrete cosine transform.
  - 23. The voice recognition system as claimed in claim 21 wherein said estimation module comprises a neural network.
  - 24. The voice recognition system as claimed in claim 21 wherein said first filter module comprises a median filter module.
  - 25. The voice recognition system as claimed in claim 19, wherein said processing sub-module comprises:
    - a framing module;
      
      a windowing module communicatively coupled to said framing module;
      
      a second transformation module communicatively coupled to said windowing module;
      
      a power spectrum module communicatively coupled to said transform module;
      
      a second filter module communicatively coupled to said power spectrum module; and
      
      a third transformation module communicatively coupled to said second filter module.
  - 26. The voice recognition system as claimed in claim 25, wherein said framing module is configured to:
    - accept speech signal; and
      
      provide a frame of the speech signal.
  - 27. The voice recognition system as claimed in claim 25, wherein said windowing module is configured to perform windowing by a Hamming function.
  - 28. The voice recognition system as claimed in claim 25, wherein said second transformation module is configured to perform a fourier transform.
  - 29. The voice recognition system as claimed in claim 25, wherein said power spectrum module is configured to perform a power spectrum determination.
  - 30. The voice recognition system as claimed in claim 25, wherein said second filter module is configured to perform a MEL filtering.
  - 31. The voice recognition system as claimed in claim 25, wherein said third transformation module is configured to perform a non-linear transformation.
  - 32. The voice recognition system as claimed in claim 31, wherein said non-linear transformation is logarithmic transformation.
  - 33. The voice recognition system as claimed in claim 19, wherein said voice activity detection module is disposed in said front end.
  - 34. The voice recognition system as claimed in claim 33, wherein said front end is disposed in a subscriber terminal.
  - 36. The voice recognition system as claimed in claim 35, wherein said first downsampling module is disposed in said feature extraction sub-module.
  - 37. The voice recognition system as claimed in claim 36 further comprising:
    - a first filter module communicatively coupled to said processing sub-module and said first downsampling module.
  - 38. The voice recognition system as claimed in claim 37 wherein said first filter module is configured to perform filtering in accordance with linear discriminant analysis.
  - 39. The voice recognition system as claimed in claim 36 further comprising:
    - a first transformation module communicatively coupled to said first downsampling module; and
      
      a normalization module communicatively coupled to said first transformation module.
  - 40. The voice recognition system as claimed in claim 39 wherein said first transformation module is configured to perform discrete cosine transform.
  - 41. The voice recognition system as claimed in claim 39, further comprising:
    - a bitstream processor communicatively coupled to said normalization module.
  - 42. The voice recognition system as claimed in claim 41, further comprising:
    - a compressor communicatively coupled to said normalization module and said bitstream processor.
  - 43. The voice recognition system as claimed in claim 35, wherein said second downsampling module is disposed in said voice activity detection sub-module.
  - 44. The voice recognition system as claimed in claim 43, further comprising:
    - a second transformation module communicatively coupled to said second downsampling module;
      
      an estimation module communicatively coupled to said second transformation module;
      
      a threshold detector communicatively coupled to said estimation module;
      
      a second filter module communicatively coupled to said threshold detector.
  - 45. The voice recognition system as claimed in claim 44 wherein said second transformation module is configured to perform discrete cosine transform.
  - 46. The voice recognition system as claimed in claim 44 wherein said estimation module comprises a neural network.
  - 47. The voice recognition system as claimed in claim 44 wherein said second filter module comprises a median filter module.
  - 48. The voice recognition system as claimed in claim 35, wherein said processing sub-module comprises:
    - a framing module;
      
      a windowing module communicatively coupled to said framing module;
      
      a third transformation module communicatively coupled to said windowing module;
      
      a power spectrum module communicatively coupled to said third transform module;
      
      a third filter module communicatively coupled to said power spectrum module; and
      
      a fourth transformation module communicatively coupled to said filtering module.
  - 49. The voice recognition system as claimed in claim 48, wherein said framing module is configured to:
    - accept speech signal; and
      
      provide a frame of the speech signal.
  - 50. The voice recognition system as claimed in claim 48, wherein said windowing module is configured to perform windowing by a Hamming function.
  - 51. The voice recognition system as claimed in claim 48, wherein said third transformation module is configured to perform a fourier transform.
  - 52. The voice recognition system as claimed in claim 48, wherein said power spectrum module is configured to perform a power spectrum determination.
  - 53. The voice recognition system as claimed in claim 48, wherein said third filter module is configured to perform a MEL filtering.
  - 54. The voice recognition system as claimed in claim 48, wherein said fourth transformation module is configured to perform a non-linear transformation.
  - 55. The voice recognition system as claimed in claim 54, wherein said non-linear transformation is logarithmic transformation.
  - 56. The voice recognition system as claimed in claim 35, further comprising a transmitter communicatively coupled to:
    - said feature extraction module; and
      
      said voice activity module.
  - 57. The voice recognition system as claimed in claim 56, wherein said processing sub-module, said feature extraction module, said voice activity detection module and said transmitter are disposed in said front end.
  - 58. The voice recognition system as claimed in claim 57, wherein said front end is disposed in a subscriber terminal.

19. In a voice recognition system comprising a front end and a back end a voice activity detection module, comprising:
- a processing sub-module; and
  
  a voice activity detection sub-module communicatively coupled to said processing sub-module;
  
  wherein a digital signal provided from said processing sub-module is downsampled in a downsampling module.

35. A voice recognition system comprising a front end and a back end, comprising:
- a processing sub-module;
  
  a feature extraction sub-module communicatively coupled to said processing sub-module, wherein a digital signal provided from said processing sub-module is downsampled in a first downsampling module; and
  
  a voice activity detection sub-module communicatively coupled to said processing sub-module, wherein the digital signal provided from said processing sub-module is downsampled in a second downsampling module.

59. A voice recognition system comprising a front end and a back end, comprising:
- a framing module;
  
  a windowing module communicatively coupled to said framing module;
  
  a first transformation module communicatively coupled to said windowing module;
  
  a power spectrum module communicatively coupled to said first transformation module;
  
  a first filtering module communicatively coupled to said power spectrum module;
  
  a second transformation module communicatively coupled to said first filtering module;
  
  a second filter module communicatively coupled to said second transformation module;
  
  a third filter module communicatively coupled to said second filter module;
  
  a first downsampling module communicatively coupled to said second filter module;
  
  a third transformation module communicatively coupled to said first downsampling module;
  
  a normalization module communicatively coupled to said third transformation module. a compressor module communicatively coupled to said normalization module;
  
  a bitstream processor communicatively coupled to said compressor module;
  
  a second downsampling module communicatively coupled to said second filter module;
  
  a fourth transformation module communicatively coupled to said second downsampling module;
  
  an estimation module communicatively coupled to said fourth transformation module;
  
  a threshold detector communicatively coupled to said estimation module;
  
  a fourth filter module communicatively coupled to said threshold detector.

60. A method for extracting at least one feature from a speech signal, comprising:
- processing a speech signal;
  
  downsampling said processed speech signal to provide a downsampled signal; and
  
  extracting the at least one feature from said downsampled signal.
- View Dependent Claims (61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71)
- - 61. The method as claimed in claim 60 further comprising:
    - filtering said downsampled signal to provide a filtered signal; and
      
      wherein said extracting the at least one feature comprises extracting the at least one feature from said filtered signal.
  - 62. The method as claimed in claim 61 wherein said filtering said downsampled signal to provide a filtered signal comprises:
    - filtering in accordance with linear discriminant analysis.
  - 63. The method as claimed in claim 62, further comprising:
    - transforming said downsampled signal to provide transformed signal;
      
      normalizing said transformed signal.
  - 64. The method as claimed in claim 63 wherein said transforming said downsampled signal to provide transformed signal comprises:
    - transforming said downsampled signal by discrete cosine transform.
  - 65. The method as claimed in claim 63, further comprising:
    - processing said transformed signal to provide an output signal.
  - 66. The method as claimed in claim 65, further comprising:
    - compressing said transformed signal to provide a compressed signal; and
      
      wherein said processing comprises processing said compressed signal to provide an output signal.
  - 67. The method as claimed in claim 60 wherein said processing a speech signal comprises:
    - framing a speech signal to provide a frame of the speech signal;
      
      windowing said framed signal to provide windowed signal;
      
      transforming said windowed signal to provide transformed signal;
      
      determinig a power spectrum of said transformed signal;
      
      filtering said determined power spectrum;
      
      transforming said filtered power spectrum.
  - 68. The method as claimed in claim 67, wherein said transforming said windowed signal comprises:
    - transforming said windowed signal by a fourier transform.
  - 69. The method as claimed in claim 67, wherein said filtering said determined power spectrum comprises:
    - filtering said determined power spectrum by a MEL filter.
  - 70. The method as claimed in claim 67, wherein said transforming said filtered power spectrum comprises:
    - transforming said filtered power spectrum by a non-linear transformation.
  - 71. The method as claimed in claim 70, wherein said transforming said filtered power spectrum by a non-linear transformation comprises:
    - transforming said filtered power spectrum by a logarithmic transformation.

72. A method for voice activity detection, comprising:
- processing a speech signal;
  
  downsampling said processed speech signal to provide a downsampled signal; and
  
  detecting voice activity of said downsampled signal.
- View Dependent Claims (73, 74, 75, 76, 77, 78, 79, 80, 81)
- - 73. The method as claimed in claim 72, further comprising:
    - transforming said downsampled signal to provide transformed signal;
      
      estimating probability of said downsampled signal being speech;
      
      applying a threshold to said estimation;
      
      filtering said estimation after said applying the threshold.
  - 74. The method as claimed in claim 73 wherein said transforming said downsampled signal to provide transformed signal comprises:
    - transforming said downsampled signal by discrete cosine transform.
  - 75. The method as claimed in claim 73 wherein said estimating probability of said downsampled signal being speech comprises:
    - estimating probability by a neural network.
  - 76. The method as claimed in claim 73 wherein said filtering said estimation comprises:
    - filtering said estimation by a median filter module.
  - 77. The method as claimed in claim 72 wherein said processing a speech signal comprises:
    - framing a speech signal to provide a frame of the speech signal;
      
      windowing said framed signal to provide windowed signal;
      
      transforming said windowed signal to provide transformed signal;
      
      determinig a power spectrum of said transformed signal;
      
      filtering said determined power spectrum;
      
      transforming said filtered power spectrum.
  - 78. The method as claimed in claim 77, wherein said transforming said windowed signal comprises:
    - transforming said windowed signal by a fourier transform.
  - 79. The method as claimed in claim 77, wherein said filtering said determined power spectrum comprises:
    - filtering said determined power spectrum by a MEL filter.
  - 80. The method as claimed in claim 77, wherein said transforming said filtered power spectrum comprises:
    - transforming said filtered power spectrum by a non-linear transformation.
  - 81. The method as claimed in claim 80, wherein said transforming said filtered power spectrum by a non-linear transformation comprises:
    - transforming said filtered power spectrum by a logarithmic transformation.

82. A method for determining speech signal characteristics, comprising:
- processing a speech signal;
  
  downsampling said processed speech signal by a first value to provide a first downsampled signal;
  
  extracting the at least one feature from said first downsampled signal;
  
  downsampling said processed speech signal by a second value to provide a second downsampled signal; and
  
  detecting voice activity from said second downsampled signal.
- View Dependent Claims (83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100)
- - 83. The method as claimed in claim 82, wherein said downsampling said processed speech signal by a second value to provide a second downsampled signal comprises:
    - downsampling said processed speech signal by the first value to provide the first downsampled signal.
  - 84. The method as claimed in claim 82 further comprising:
    - filtering said first downsampled signal to provide a filtered signal; and
      
      wherein said extracting the at least one feature comprises extracting the at least one feature from said filtered signal.
  - 85. The method as claimed in claim 84 wherein said filtering said first downsampled signal to provide a filtered signal comprises:
    - filtering in accordance with linear discriminant analysis.
  - 86. The method as claimed in claim 84, further comprising:
    - transforming said first downsampled signal to provide transformed signal;
      
      normalizing said transformed signal.
  - 87. The method as claimed in claim 86 wherein said transforming said downsampled signal to provide transformed signal comprises:
    - transforming said first downsampled signal by discrete cosine transform.
  - 88. The method as claimed in claim 86, further comprising:
    - processing said transformed signal to provide an output signal.
  - 89. The method as claimed in claim 88, further comprising:
    - compressing said transformed signal to provide a compressed signal; and
      
      wherein said processing comprises processing said compressed signal to provide an output signal.
  - 90. The method as claimed in claim 82, further comprising:
    - transforming said second downsampled signal to provide transformed signal;
      
      estimating probability of said second downsampled signal being speech;
      
      applying a threshold to said estimation;
      
      filtering said estimation after applying the threshold.
  - 91. The method as claimed in claim 90 wherein said transforming said second downsampled signal to provide transformed signal comprises:
    - transforming said second downsampled signal by discrete cosine transform.
  - 92. The method as claimed in claim 90 wherein said estimating probability of said second downsampled signal being speech comprises:
    - estimating probability by a neural network.
  - 93. The method as claimed in claim 90 wherein said filtering said estimation after applying the threshold comprises:
    - filtering said estimation by a median filter module.
  - 94. The method as claimed in claim 82 wherein said processing a speech signal comprises:
    - framing a speech signal to provide a frame of the speech signal;
      
      windowing said framed signal to provide windowed signal;
      
      transforming said windowed signal to provide transformed signal;
      
      determining a power spectrum of said transformed signal;
      
      filtering said determined power spectrum;
      
      transforming said filtered power spectrum.
  - 95. The method as claimed in claim 94, wherein said transforming said windowed signal comprises:
    - transforming said windowed signal by a fourier transform.
  - 96. The method as claimed in claim 94, wherein said filtering said determined power spectrum comprises:
    - filtering said determined power spectrum by a MEL filter.
  - 97. The method as claimed in claim 94, wherein said transforming said filtered power spectrum comprises:
    - transforming said filtered power spectrum by a non-linear transformation.
  - 98. The method as claimed in claim 97, wherein said transforming said filtered power spectrum by a non-linear transformation comprises:
    - transforming said filtered power spectrum by a logarithmic transformation.
  - 99. The method as claimed in claim 94, further comprising;
    - transmitting said extracted at least one feature and said detected voice activity.
  - 100. The method as claimed in claim 99, wherein said detected voice activity is transmitted ahead of said extracted at least one feature.

101. A system for processing speech, comprising:
- a terminal feature extraction submodule for extracting at least one feature from the speech; and
  
  a terminal compression module for distinguishing the presence of voice activity from silence in the speech to determine voice activity data, compressing the at least one feature, and selectively combining and transmitting the at least one feature with selected voice activity data.
- View Dependent Claims (102, 103, 104)
- - 102. The system of claim 101, further comprising:
    - a server decompression module for receiving and decompressing the selectively combined and transmitted at least one feature and selected voice activity data into decompression data;
      
      a server feature vector generator for generating a feature vector from the decompression data; and
      
      a speech recognition module for determining speech based on the feature vector.
  - 103. The system of claim 101, wherein the terminal compression module comprises a voice activity detection module.
  - 104. The system of claim 101, wherein the terminal feature extraction module and the terminal compression module reside on a subscriber unit.

105. A distributed voice recognition system for transmitting speech activity, comprising:
- a subscriber unit, comprising;
  
  a processing/feature extraction element receiving speech activity and converting the speech activity into features;
  
  a voice activity detector for detecting voice activity within said speech and providing at least one voice activity indication; and
  
  a processor for selectively combining the features with the at least one voice activity indication into advanced front end features; and
  
  a transmitter for transmitting the advanced front end features to a remote device.
- View Dependent Claims (106)
- - 106. The distributed voice recognition system of claim 105, wherein said remote device comprises:
    - a receiver for receiving the advanced front end features;
      
      a word decoder for decoding the received information into words; and
      
      a transmitter for transmitting the decoded words to an appropriate subscriber unit.

107. A subscriber unit, comprising:
- means for extracting a plurality of features of a speech signal;
  
  means for detecting voice activity with the speech signal and providing an indication of the detected voice activity; and
  
  a transmitter coupled to the feature extraction means and the voice activity detection means and configured to selectively transmit indication of detected voice activity in selective combination with the plurality of features to a remote device.
- View Dependent Claims (108)
- - 108. The subscriber unit of claim 107, further comprising a means for combining the plurality of features with the indication of detected voice activity, wherein the indication of detected voice activity is ahead of the plurality of features.

109. A system for generating feature vectors, comprising:
- a time derivative computation block for computing feature time derivatives;
  
  a feature concatenation block for combining feature time derivatives with features;
  
  a dual branch processor receiving data from said feature concatenation block, comprising;
  
  a first branch, comprising a multiple frame assembly module; and
  
  a second branch comprising a nonlinear transformation module and a dimensionality reduction and decorrelation module; and
  
  a processing concatenation block for concatenating data computed by said first branch and said second branch.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Computer Science Institute, Oregon Graduate Institute of Science and, Qualcomm, Inc.
Original Assignee
International Computer Science Institute, Oregon Graduate Institute of Science and, Qualcomm, Inc.
Inventors
Garudadri, Harinath, Sivadas, Sunil, Hermansky, Hynek, Kajarekar, Sachin, Dupont, Stephane N., Ortuzar, Maria Carmen Benitez, Jain, Pratibha, Burget, Lukas, Morgan, Nelson H.

Application Number

US10/059,737
Publication Number

US 20030004720A1
Time in Patent Office

Days
Field of Search
US Class Current

704/247
CPC Class Codes

G10L 15/02 Feature extraction for spee...

G10L 15/30 Distributed recognition, e....

System and method for computing and transmitting parameters in a distributed voice recognition system

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

95 Citations

109 Claims

Specification

Use Cases

Quick Links

Others

System and method for computing and transmitting parameters in a distributed voice recognition system

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

95 Citations

109 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others