Detecting emotions using voice signal analysis

US 20020194002A1
Filed: 07/12/2002
Published: 12/19/2002
Est. Priority Date: 08/31/1999
Status: Active Grant

First Claim

Patent Images

1. A method of detecting an emotional state, the method comprising:

providing a speech signal;

dividing the speech signal into at least one of segments, frames, and subframes;

extracting at least one acoustic feature from the speech signal;

calculating statistics from the at least one acoustic feature;

classifying the speech with at least one neural network classifier as belonging to at least one emotional state; and

storing in memory and outputting in a human-recognizable format an indication of the at least one emotional state, wherein the speech is classified by a classifier taught to recognize at least one emotional state from a finite number of emotional states.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system and method are provided for detecting emotional states using statistics. First, a speech signal is received. At least one acoustic parameter is extracted from the speech signal. Then statistics or features from samples of the voice are calculated from extracted speech parameters. The features serve as inputs to a classifier, which can be a computer program, a device or both. The classifier assigns at least one emotional state from a finite number of possible emotional states to the speech signal. The classifier also estimates the confidence of its decision. Features that are calculated may include a maximum value of a fundamental frequency, a standard deviation of the fundamental frequency, a range of the fundamental frequency, a mean of the fundamental frequency, and a variety of other statistics.

331 Citations

59 Claims

1. A method of detecting an emotional state, the method comprising:
- providing a speech signal;
  
  dividing the speech signal into at least one of segments, frames, and subframes;
  
  extracting at least one acoustic feature from the speech signal;
  
  calculating statistics from the at least one acoustic feature;
  
  classifying the speech with at least one neural network classifier as belonging to at least one emotional state; and
  
  storing in memory and outputting in a human-recognizable format an indication of the at least one emotional state, wherein the speech is classified by a classifier taught to recognize at least one emotional state from a finite number of emotional states.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59)
- - 2. The method of claim 1, wherein the statistic is selected from the group consisting of a fundamental frequency, a duration of the speech, an energy of the speech, and formants of the speech.
  - 3. The method of claim 1, wherein the statistics are calculated for at least one of a segment, a frame and a subframe, and the statistics are selected from the group consisting of a maximum value of a fundamental frequency, a standard deviation of the fundamental frequency, a range of the fundamental frequency, a mean of the fundamental frequency, a mean of a bandwidth of a first formant, a mean of a bandwidth of a second formant, an energy standard deviation, and a speaking rate.
  - 4. The method of claim 1, wherein the statistics are calculated for at least one of a segment, a frame and a subframe, and the statistics are selected from the group consisting of a maximum value of a fundamental frequency, a standard deviation of the fundamental frequency, a range of the fundamental frequency, a mean of the fundamental frequency, a mean of a bandwidth of a first formant, a mean of a bandwidth of a second formant, an energy standard deviation, a speaking rate, a slope of a fundamental frequency, and a maximum of a first formant.
  - 5. The method of claim 1, wherein the statistics are calculated for at least one of a segment, a frame and a subframe, and the statistics are selected from the group consisting of a maximum value of a fundamental frequency, a standard deviation of the fundamental frequency, a range of the fundamental frequency, a mean of the fundamental frequency, a mean of a bandwidth of a first formant, a mean of a bandwidth of a second formant, an energy standard deviation, a speaking rate, a slope of a fundamental frequency, a maximum of a first formant, an energy maximum, an energy range, a first formant range, and a second formant range.
  - 6. The method of claim 1, wherein the speech is classified as emotional or non-emotional.
  - 7. The method of claim 1, wherein the indication is a probability of at least one emotion.
  - 8. The method of claim 1, wherein the indication is a statistic of at least one feature of a voice signal.
  - 9. The method of claim 8, wherein the statistic is selected from the group consisting of a maximum value of a fundamental frequency, a standard deviation of the fundamental frequency, a range of the fundamental frequency, a mean of the fundamental frequency, a mean of a bandwidth of a first formant, a mean of a bandwidth of a second formant, an energy standard deviation, a speaking rate, a slope of a fundamental frequency, a maximum of a first formant, an energy maximum, an energy range, a first formant range, a second formant range, a percentage of silent time of at least one segment of the voice signal, a length of the voice signal, and a distribution of emotions in the voice signal.
  - 10. The method of claim 1, wherein the speech is classified as at least one of angry, sad, happy, afraid and neutral.
  - 11. The method of claim 1, wherein the speech is divided into segments of from about 1 second to about 3 seconds long.
  - 12. The method of claim 1, wherein the at least one neural network is taught to recognize an emotional state by dividing speech samples into training and testing segments, and wherein an algorithm for recognizing an emotional state is adjusted by comparing a classification from the neural network to a classification by at least one person.
  - 13. The method of claim 1, further comprising dividing the speech signal into at least one of segments from about 1 and about 3 seconds long, frames from about 20 to about 40 milliseconds long, and subframes from about 10 to about 20 milliseconds long.
  - 14. The method of claim 1, further comprising providing a predetermined response to a person voicing the speech signal displaying a particular emotional state.
  - 15. The method of claim 1, further comprising routing a call containing said speech signal to a predetermined location according to the at least one classified emotional state.
  - 16. The method of claim 1, wherein the indication is routed to at least one of a video display device, an audio output device, a computer, a printer and a alarm.
  - 17. The method of claim 1, further comprising routing a call containing the speech signal to a location selected from the group consisting of a voice-mail center, a call center, an e-mail destination, a customer service center, a manager, and emergency response personnel.
  - 19. The system of claim 18, wherein the logic for dividing the speech signal divides the speech signal into at least one of frames, subframes and segments.
  - 20. The system of claim 18, wherein the logic for extracting features extracts a feature selected from the group consisting of a fundamental frequency, an energy of the speech, a duration of the speech, a pausing rate, and formants of the speech.
  - 21. The system of claim 18, wherein the logic for a least one neural network comprises at least one three-layer neural network.
  - 22. The system of claim 18, wherein the statistics that are calculated are selected from the group consisting of a maximum value of a fundamental frequency, a standard deviation of the fundamental frequency, a range of the fundamental frequency, a mean of the fundamental frequency, a mean of a bandwidth of a first formant, a mean of a bandwidth of a second formant, an energy standard deviation, and a speaking rate.
  - 23. The system of claim 18, wherein the statistics that are calculated are selected from the group consisting of a maximum value of a fundamental frequency, a standard deviation of the fundamental frequency, a range of the fundamental frequency, a mean of the fundamental frequency, a mean of a bandwidth of a first formant, a mean of a bandwidth of a second formant, an energy standard deviation, a speaking rate, a slope of a fundamental frequency, and a maximum of a first formant.
  - 24. The system of claim 18, wherein the statistics that are calculated are selected from the group consisting of a maximum value of a fundamental frequency, a standard deviation of the fundamental frequency, a range of the fundamental frequency, a mean of the fundamental frequency, a mean of a bandwidth of a first formant, a mean of a bandwidth of a second formant, an energy standard deviation, a speaking rate, a slope of a fundamental frequency, a maximum of a first formant, an energy maximum, an energy range, a first formant range, and a second formant range.
  - 25. The system of claim 18, wherein the logic classifies the speech as one of emotional and non-emotional.
  - 26. The system of claim 18, wherein the logic classifies the speech as calm or non-calm, wherein non-calm is defined as one of angry, happy and afraid.
  - 27. The system of claim 18, wherein the logic classifies the speech as at least one of angry, sad, happy, afraid and normal.
  - 28. The system of claim 18, further comprising logic for providing a predetermined response to a person voicing the speech signal displaying a particular emotional state.
  - 29. The system of claim 18, further comprising logic for routing a call containing said speech signal to a predetermined location according to the at least one classified emotional state.
  - 30. The system of claim 18, wherein the indication is a probability of at least one emotion.
  - 31. The system of claim 18, wherein the indication is a statistic of at least one feature of a voice signal.
  - 32. The system of claim 31, wherein the statistic is selected from the group consisting of a maximum value of a fundamental frequency, a standard deviation of the fundamental frequency, a range of the fundamental frequency, a mean of the fundamental frequency, a mean of a bandwidth of a first formant, a mean of a bandwidth of a second formant, an energy standard deviation, a speaking rate, a slope of a fundamental frequency, a maximum of a first formant, an energy maximum, an energy range, a first formant range, a second formant range, a percentage of silent time of at least one segment of the voice signal, a length of the voice signal, and a distribution of emotions in the voice signal.
  - 33. The system of claim 18, wherein the logic routes the indication to at least one of a video display device, an audio output device, a computer, a printer and a alarm.
  - 34. The system of claim 18, further comprising logic for routing a call containing the speech signal to a location selected from the group consisting of a voice-mail center, a call center, an e-mail destination, a customer service center, a manager, and emergency response personnel.
  - 36. The method of claim 35, wherein the segments are from about 1 to about 3 seconds long, the subframes are from about 10 to 20 milliseconds long, and the frames are about 20 to about 40 milliseconds long.
  - 37. The method of claim 35, wherein the acoustic features are selected from the group consisting of a fundamental pitch, a duration of the speech, an energy of the speech, and formants of the speech.
  - 38. The method of claim 35, wherein the at least one neural network is a plurality of neural networks, each neural network trained differently, and the emotional state is classified by a majority vote of the plurality of neural networks.
  - 39. The method of claim 35, further comprising providing a predetermined response to a person voicing the speech signal displaying a particular emotional state.
  - 40. The method of claim 35, further comprising routing a call containing said speech signal to a predetermined location according to the at least one classified emotional state.
  - 41. The method of claim 35, wherein the indication is a probability of at least one emotion.
  - 42. The method of claim 35, wherein the indication is a statistic of at least one feature of a voice signal.
  - 43. The method of claim 42, wherein the statistic is selected from the group consisting of a maximum value of a fundamental frequency, a standard deviation of the fundamental frequency, a range of the fundamental frequency, a mean of the fundamental frequency, a mean of a bandwidth of a first formant, a mean of a bandwidth of a second formant, an energy standard deviation, a speaking rate, a slope of a fundamental frequency, a maximum of a first formant, an energy maximum, an energy range, a first formant range, a second formant range, a percentage of silent time of at least one segment of the voice signal, a length of the voice signal, and a distribution of emotions in the voice signal.
  - 44. The method of claim 35, further comprising routing the indication to at least one of a video display device, an audio output device, a computer, a printer and a alarm.
  - 45. The method of claim 35, further comprising routing a call containing the speech signal to a location selected from the group consisting of a voice-mail center, a call center, an e-mail destination, a customer service center, a manager, and emergency response personnel.
  - 47. The system of claim 46, wherein the at least one computer is a digital signal processor.
  - 48. The system of claim 46, wherein the at least one neural network comprises at least one three-layer neural network.
  - 49. The system of claim 46, wherein the computer program also divides the voice signal into samples selected from the group consisting of frames and subframes.
  - 50. The system of claim 46, wherein the statistics used for comparison are selected from the group consisting of a maximum value of a fundamental frequency, a standard deviation of the fundamental frequency, a range of the fundamental frequency, a mean of the fundamental frequency, a mean of a bandwidth of a first formant, a mean of a bandwidth of a second formant, an energy standard deviation, a speaking rate, a slope of a fundamental frequency, a maximum of a first formant, an energy maximum, an energy range, a first formant range, and a second formant range.
  - 51. The system of claim 46, further comprising logic for routing a call containing said voice signal in accordance with the emotional state of the speech.
  - 52. The system of claim 46, further comprising a set of predetermined responses to persons voicing a speech signal displaying a particular emotional state.
  - 53. The system of claim 46, further comprising logic for providing a predetermined response to a person voicing the speech signal displaying a particular emotional state.
  - 54. The system of claim 46, further comprising logic for routing a call containing said speech signal to a predetermined location according to the at least one classified emotional state.
  - 55. The system of claim 46, wherein the indication is a probability of at least one emotion.
  - 56. The system of claim 46, wherein the indication is a statistic of at least one feature of a voice signal.
  - 57. The system of claim 56, wherein the statistic is selected from the group consisting of a maximum value of a fundamental frequency, a standard deviation of the fundamental frequency, a range of the fundamental frequency, a mean of the fundamental frequency, a mean of a bandwidth of a first formant, a mean of a bandwidth of a second formant, an energy standard deviation, a speaking rate, a slope of a fundamental frequency, a maximum of a first formant, an energy maximum, an energy range, a first formant range, a second formant range, a percentage of silent time of at least one segment of the voice signal, a length of the voice signal, and a distribution of emotions in the voice signal.
  - 58. The system of claim 46, wherein the logic routes the indication to at least one of a video display device, an audio output device, a computer, a printer and a alarm.
  - 59. The system of claim 46, further comprising logic for routing a call containing the speech signal to a location selected from the group consisting of a voice-mail center, a call center, an e-mail destination, a customer service center, a manager, and emergency response personnel.

18. A system for classifying speech, the system comprising:
- a computer system comprising a central processing unit, an input device, at least one memory for storing data indicative of a speech signal, and an output device;
  
  logic for receiving and analyzing a speech signal;
  
  logic for dividing the speech signal;
  
  logic for extracting at least one feature from the speech signal;
  
  logic for calculating statistics of the speech;
  
  logic for at least one neural network for classifying the speech as belonging to at least one of a finite number of emotional states; and
  
  logic for storing in memory and outputting an indication of the at least one emotional state.

35. A method of recognizing emotional states in a voice, the method comprising:
- providing a first plurality and a second plurality of voice samples;
  
  identifying each sample of said pluralities of samples as belonging to a predominant emotional state;
  
  dividing each sample into at least one of frames, subframes, and segments;
  
  extracting at least one acoustic feature for each sample of the pluralities of samples;
  
  calculating statistics of the speech samples from the at least one feature;
  
  classifying an emotional state in the first plurality of samples with at least one neural network;
  
  training the at least one neural network to recognize an emotional state from the statistics by comparing the results of identifying and recognizing for the first plurality of samples;
  
  classifying an emotion in the second plurality of voice samples with the at least one trained neural network; and
  
  storing in memory and outputting in a human-recognizable format an indication of the emotional state.

46. A system for detecting an emotional state in a voice signal, the system comprising:
- a speech reception device;
  
  at least one computer connected to the speech reception device;
  
  at least one memory operably connected to the at least one computer;
  
  a computer program including at least one neural network for dividing the voice signal into a plurality of segments, and for analyzing the segments according to features of the segments to detect the emotional state in the voice signal;
  
  a database of speech signal features and statistics accessible to the computer for comparison with features of the voice signal; and
  
  an output device coupled to the computer for notifying a user of the emotional state detected in the voice signal.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Accenture Global Services Limited (Accenture PLC)
Original Assignee
Accenture LLP (Accenture PLC)
Inventors
Petrushin, Valery A.

Granted Patent

US 7,222,075 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/270
CPC Class Codes

G10L 17/26   Recognition of special voic...

G10L 25/30   using neural networks

H04M 3/436   Arrangements for screening ...

H04M 3/533   Voice mail systems

Detecting emotions using voice signal analysis

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

331 Citations

59 Claims

Specification

Use Cases

Quick Links

Others

Detecting emotions using voice signal analysis

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

331 Citations

59 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others