Detection of end of utterance in speech recognition system

US 9,117,460 B2
Filed: 05/12/2004
Issued: 08/25/2015
Est. Priority Date: 05/12/2004
Status: Active Grant

First Claim

Patent Images

1. A system comprising a speech recognizer with end of utterance detection, whereinthe speech recognizer is configured to calculate values of state scores and token scores associated with frames of received speech data,the speech recognizer is configured to determine best state scores and best token scores, a best state score being a score of a state having the best probability amongst a number of states in a state model for speech recognition purposes, and a best token score being the best probability of a token amongst a number of tokens used for speech recognition purposes,the speech recognizer is configured to, at each received frame of received speech data, determine whether recognition result determined from received speech data is stabilized,if the recognition result determined from received speech data is not stabilized at a current frame, the speech recognizer is configured to continue speech processing for a next received speech frame and to calculate values of state scores and token scores and to determine the best state score and best token score for the next received speech frame,if the recognition result determined from speech data is stabilized at the current frame, the speech recognizer is configured to, in place of continuing speech processing for the next received frame, process values of the determined best state scores and best token scores associated with frames of received speech data for end of utterance detection purposes, and on the basis of the processed values of the best state scores and best token scores, whether end of utterance is detected or not,if the end of utterance is not detected on the basis of the processed values of the best state scores and best token scores, the speech recognizer is configured to continue speech processing for a next received speech frame and to calculate values of state scores and token scores and to determine the best state score and best token score for the next received speech frame, andif the end of utterance is detected on the basis of the processed values of the best state scores and best token scores, the speech recognizer is configured to end the speech processing.

View all claims

10 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The present invention relates to speech recognition systems, especially to arranging detection of end-of utterance in such systems. A speech recognizer of the system is configured to determine whether recognition result determined from received speech data is stabilized. The speech recognizer is configured to process values of best state scores and best token scores associated with frames of received speech data for end of utterance detection purposes. Further, the speech recognizer is configured to determine whether end of utterance is detected or not, based on the processing, if the recognition result is stabilized.

23 Citations

View as Search Results

36 Claims

1. A system comprising a speech recognizer with end of utterance detection, whereinthe speech recognizer is configured to calculate values of state scores and token scores associated with frames of received speech data,the speech recognizer is configured to determine best state scores and best token scores, a best state score being a score of a state having the best probability amongst a number of states in a state model for speech recognition purposes, and a best token score being the best probability of a token amongst a number of tokens used for speech recognition purposes,the speech recognizer is configured to, at each received frame of received speech data, determine whether recognition result determined from received speech data is stabilized,if the recognition result determined from received speech data is not stabilized at a current frame, the speech recognizer is configured to continue speech processing for a next received speech frame and to calculate values of state scores and token scores and to determine the best state score and best token score for the next received speech frame,if the recognition result determined from speech data is stabilized at the current frame, the speech recognizer is configured to, in place of continuing speech processing for the next received frame, process values of the determined best state scores and best token scores associated with frames of received speech data for end of utterance detection purposes, and on the basis of the processed values of the best state scores and best token scores, whether end of utterance is detected or not,if the end of utterance is not detected on the basis of the processed values of the best state scores and best token scores, the speech recognizer is configured to continue speech processing for a next received speech frame and to calculate values of state scores and token scores and to determine the best state score and best token score for the next received speech frame, andif the end of utterance is detected on the basis of the processed values of the best state scores and best token scores, the speech recognizer is configured to end the speech processing.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
- - 2. A system according to claim 1, wherein the speech recognizer is configured to calculate a best state score sum by summing the best state score values of a pre-determined number of frames,in response to the recognition result being stabilized, the speech recognizer is configured to compare the best state score sum to a predetermined threshold sum value, andthe speech recognizer is configured to determine detection of end of utterance if the best state score sum does not exceed the threshold sum value.
  - 3. A system according to claim 2, wherein the speech recognizer is configured to normalize the best score sum by the number of detected silence models, andthe speech recognizer is configured to compare the normalized best state score sum to the pre-determined threshold sum value.
  - 4. A system according to claim 2, wherein the speech recognizer is further configured to compare the number of best state score sums exceeding the threshold sum value to a predetermined minimum number value defining the required minimum number of best state score sums exceeding the threshold sum value, andthe speech recognizer is configured to determine detection of end of utterance if the number of best state score sums exceeding the threshold sum value is the same or larger than the predetermined minimum number value.
  - 5. A system according to claim 1, wherein the speech recognizer is configured to wait a pre-determined time period before determining detection of end of utterance.
  - 6. A system according to claim 1, wherein the speech recognizer is configured to determine best token score values repetitively,the speech recognizer is configured to calculate the slope of the best token score values based on at least two best token score values,the speech recognizer is configured to compare the slope to a pre-determined threshold slope value, andthe speech recognizer is configured to determine detection of end of utterance if the slope does not exceed the threshold slope value.
  - 7. A system according to claim 6, wherein the slope is calculated for each frame.
  - 8. A system according to claim 6, wherein the speech recognizer is further configured to compare the number of slopes exceeding the threshold slope value to a predetermined minimum number of slopes exceeding the threshold slope value, andthe speech recognizer is configured to determine detection of end of utterance if the number of best state score sums exceeding the threshold slope value is the same or larger than the predetermined minimum number.
  - 9. A system according to claim 6, wherein the speech recognizer is configured to begin slope calculations only after a pre-determined number of frames has been received.
  - 10. A system according to claim 1, wherein the speech recognizer is configured to determine best token score of at least one inter-word token and best token score of an exit token, andthe speech recognizer is configured to determine detection of end of utterance only if the best token score value of the exit token is higher than the best token score of the inter-word token.
  - 11. A system according to claim 1, wherein the speech recognizer is configured to determine detection of end of utterance only if the recognition result is not rejected.
  - 12. A system according to claim 1, wherein the speech recognizer is configured to determine detection of end of utterance after a maximum number of frames producing substantially the same recognition result has been received.

13. A method comprising:
- processing, in a data processing device, values of best state scores and best token scores associated with frames of received speech data for end of utterance detection purposes, the processing comprising;
  
  calculating values of state scores and token scores associated with frames of received speech data,determining best state scores and best token scores, a best state score being a score of a state having the best probability amongst a number of states in a state model for speech recognition purposes, and a best token score being the best probability of a token amongst a number of tokens used for speech recognition purposes,determining whether recognition result determined from received speech data is stabilized, anddetermining, in response to the recognition result being stabilized, on the basis of the processed values of the best state scores and best token scores, whether end of utterance is detected or not.
- View Dependent Claims (14, 15, 16, 17)
- - 14. A method according to claim 13, wherein a best state score sum is calculated by summing the best state score values of a pre-determined number of frames,in response to the recognition result being stabilized, the best state score sum is compared to a predetermined threshold sum value, andthe detection of end of utterance is determined if the best state score sum does not exceed the threshold sum value.
  - 15. A method according to claim 13, wherein best token score values are determined repetitively,the slope of the best token score values is calculated based on at least two best token score values,the slope is compared to a pre-determined threshold slope value, andthe detection of end of utterance is determined if the slope does not exceed the threshold slope value.
  - 16. A method according to claim 13, wherein best token score of at least one inter-word token and best token score of an exit token are determined, andthe detection of end of utterance is determined only if the best token score value of the exit token is higher than the best token score of the inter-word token.
  - 17. A method according to claim 13, wherein the detection of end of utterance is determined only if the recognition result is not rejected.

18. An electronic device comprising a speech recognizer, wherein the speech recognizer is configured to determine whether recognition result determined from received speech data is stabilized,the speech recognizer is configured to process values of best state scores and best token scores associated with frames of received speech data for end of utterance detection purposes, the processing comprising:
- calculating values of state scores and token scores associated with frames of received speech data,determining best state scores and best token scores, a best state score being a score of a state having the best probability amongst a number of states in a state model for speech recognition purposes, and a best token score being the best probability of a token amongst a number of tokens used for speech recognition purposes, andthe speech recognizer is configured to determine, in response to the recognition result being stabilized, on the basis of the processed values of the best state scores and best token scores whether end of utterance is detected or not.
- View Dependent Claims (19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30)
- - 19. An electronic device according to claim 18, wherein the speech recognizer is configured to calculate a best state score sum by summing the best state score values of a pre-determined number of frames,in response to the recognition result being stabilized, the speech recognizer is configured to compare the best state score sum to a predetermined threshold sum value, andthe speech recognizer is configured to determine detection of end of utterance if the best state score sum does not exceed the threshold sum value.
  - 20. An electronic device according to claim 19, wherein the speech recognizer is configured to normalize the best score sum by the number of detected silence models, andthe speech recognizer is configured to compare the normalized best state score sum to the pre-determined threshold sum value.
  - 21. An electronic device according to claim 19, wherein the speech recognizer is further configured to compare the number of best state score sums exceeding the threshold sum value to a predetermined minimum number value defining the required minimum number of best state score sums exceeding the threshold sum value, andthe speech recognizer is configured to determine detection of end of utterance if the number of best state score sums exceeding the threshold sum value is the same or larger than the predetermined minimum number value.
  - 22. An electronic device according to claim 18, wherein the speech recognizer is configured to wait a pre-determined time period before determining detection of end of utterance.
  - 23. An electronic device according to claim 18, wherein the speech recognizer is configured to determine best token score values repetitively,the speech recognizer is configured to calculate the slope of the best token score values based on at least two best token score values,the speech recognizer is configured to compare the slope to a pre-determined threshold slope value, andthe speech recognizer is configured to determine detection of end of utterance if the slope does not exceed the threshold slope value.
  - 24. An electronic device according to claim 23, wherein the slope is calculated for each frame.
  - 25. An electronic device according to claim 23, wherein the speech recognizer is further configured to compare the number of slopes exceeding the threshold slope value to a predetermined minimum number of slopes exceeding the threshold slope value, andthe speech recognizer is configured to determine detection of end of utterance if the number of best state score sums exceeding the threshold slope value is the same or larger than the predetermined minimum number.
  - 26. An electronic device according to claim 23, wherein the speech recognizer is configured to begin slope calculations only after a pre-determined number of frames has been received.
  - 27. An electronic device according to claim 18, wherein the speech recognizer is configured to determine best token score of at least one inter-word token and best token score of an exit token, andthe speech recognizer is configured to determine detection of end of utterance only if the best token score value of the exit token is higher than the best token score of the inter-word token.
  - 28. An electronic device according to claim 18, wherein the speech recognizer is configured to determine detection of end of utterance only if the recognition result is not rejected.
  - 29. An electronic device according to claim 18, wherein the speech recognizer is configured to determine detection of end of utterance after a maximum number of frames producing substantially the same recognition result has been received.
  - 30. An electronic device according to claim 18, wherein the electronic device is a mobile phone or a PDA device.

31. A non-transitory computer readable medium encoded with a computer program, loadable into the memory of a data processing device, the computer program comprising:
- program code for processing values of best state scores and best token scores associated with frames of received speech data for end of utterance detection purposes, the processing comprisingcalculating values of state scores and token scores associated with frames of received speech data,determining best state scores and best token scores, a best state score being a score of a state having the best probability amongst a number of states in a state model for speech recognition purposes, and a best token score being the best probability of a token amongst a number of tokens used for speech recognition purposes,program code for determining whether recognition result determined from received speech data is stabilized, andprogram code for determining, in response to the recognition result being stabilized, on the basis of the processed values of the best state scores and best token scores, whether end of utterance is detected or not.
- View Dependent Claims (32)
- - 32. A non-transitory computer readable medium according to claim 31, wherein at least part of the medium comprises a circuit or a memory.

33. An apparatus comprising a processor and a memory, the apparatus being configured to:
- receive frames of speech data;
  
  determine whether recognition result determined from the received speech data is stabilized;
  
  process values of best state scores and best token scores associated with frames of received speech data for end of utterance detection purposes, the process comprisingcalculating values of state scores and token scores associated with frames of received speech data,determining best state scores and best token scores, a best state score being a score of a state having the best probability amongst a number of states in a state model for speech recognition purposes, and a best token score being the best probability of a token amongst a number of tokens used for speech recognition purposes; and
  
  determine, in response to the recognition result being stabilized, on the basis of the processed values of the best state scores and best token scores, whether end of utterance is detected or not.
- View Dependent Claims (34)
- - 34. An apparatus according to claim 33, where at least part of the apparatus comprises a circuit.

35. An apparatus comprising:
- means for receiving frames of speech data;
  
  means for determining whether a recognition result determined from the received speech data is stabilized;
  
  means for processing values of best state scores and best token scores associated with frames of received speech data for end of utterance detection purposes, the processing comprisingmeans for calculating values of state scores and token scores associated with frames of received speech data,means for determining best state scores and best token scores, a best state score being a score of a state having the best probability amongst a number of states in a state model for speech recognition purposes, and a best token score being the best probability of a token amongst a number of tokens used for speech recognition purposes; and
  
  means for determining, in response to the recognition result being stabilized, on the basis of the processed values of the best state scores and best token scores, whether end of utterance is detected or not.
- View Dependent Claims (36)
- - 36. An apparatus according to claim 35, further comprising:
    - means for calculating a best state score sum by summing the best state score values of a pre-determined number of frames,means for comparing the best state score sum to a predetermined threshold sum value in response to the recognition result being stabilized, andmeans for determining detection of end of utterance if the best state score sum does not exceed the threshold sum value.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Conversant Wireless Licensing S.à.r.l. (f/k/a Core Wireless Licensing S.a.r.l.) (MOSAID Technologies Inc. (f/k/a Conversant Intellectual Property Management))
Original Assignee
Conversant Wireless Licensing S.à.r.l. (f/k/a Core Wireless Licensing S.a.r.l.) (MOSAID Technologies Inc. (f/k/a Conversant Intellectual Property Management))
Inventors
Lahti, Tommi
Primary Examiner(s)
ADESANYA, OLUJIMI A

Application Number

US10/844,211
Publication Number

US 20050256711A1
Time in Patent Office

4,122 Days
Field of Search

704/251, 704/252, 704/253, 704/255, 704/256, 704/258
US Class Current

1/1
CPC Class Codes

G10L 25/87 Detection of discrete point...

Detection of end of utterance in speech recognition system

First Claim

10 Assignments

0 Petitions

Accused Products

Abstract

23 Citations

36 Claims

Specification

Solutions

Use Cases

Quick Links

Detection of end of utterance in speech recognition system

First Claim

10 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

23 Citations

36 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links