Emotion recognition in video conferencing

US 10,599,917 B1
Filed: 12/01/2017
Issued: 03/24/2020
Est. Priority Date: 03/18/2015
Status: Active Grant

First Claim

Patent Images

1. A method comprising:

receiving a video including a sequence of images and an audio stream from a video conference between a first user interface and a second user interface;

detecting a face of an individual in one or more of the images;

recognizing a speech emotion in the audio stream;

generating a communication bearing data associated with the speech emotion;

transmitting the communication bearing data over a communications network; and

switching the video conference from between the first user interface and the second user interface to between the first user interface and a third user interface responsive to the communication bearing data.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Methods and systems for videoconferencing include recognition of emotions related to one videoconference participant such as a customer. This ultimately enables another videoconference participant, such as a service provider or supervisor, to handle angry, annoyed, or distressed customers. One example method includes the steps of receiving a video that includes a sequence of images, detecting at least one object of interest (e.g., a face), locating feature reference points of the at least one object of interest, aligning a virtual face mesh to the at least one object of interest based on the feature reference points, finding over the sequence of images at least one deformation of the virtual face mesh that reflect face mimics, determining that the at least one deformation refers to a facial emotion selected from a plurality of reference facial emotions, and generating a communication bearing data associated with the facial emotion.

Citations

20 Claims

1. A method comprising:
- receiving a video including a sequence of images and an audio stream from a video conference between a first user interface and a second user interface;
  
  detecting a face of an individual in one or more of the images;
  
  recognizing a speech emotion in the audio stream;
  
  generating a communication bearing data associated with the speech emotion;
  
  transmitting the communication bearing data over a communications network; and
  
  switching the video conference from between the first user interface and the second user interface to between the first user interface and a third user interface responsive to the communication bearing data.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
- - 2. The method of claim 1, wherein the speech emotion is a neutral emotion.
  - 3. The method of claim 1, wherein the speech emotion is a positive emotion.
  - 4. The method of claim 1, wherein the speech emotion is a negative emotion.
  - 5. The method of claim 1, wherein the recognizing the speech emotion in the audio stream includes:
    - extracting at least one voice feature from the audio stream;
      
      comparing the extracted at least one voice feature to a plurality of reference voice features; and
      
      selecting the speech emotion based on the comparison of the extracted at least one voice feature to the plurality of reference voice features.
  - 6. The method of claim 5, wherein the extracted at least one voice feature includes at least one of maximum value of fundamental frequency, standard deviation of fundamental frequency, range of fundamental frequency, mean value of fundamental frequency, mean of bandwidth of first formant, mean of bandwidth of second formant, standard deviation of energy, speaking rate, slope of fundamental frequency, maximum value of first formant, maximum value of second formant, maximum value of energy, range of energy, range of second formant, and range of first formant.
  - 7. The method of claim 5, wherein the comparing the extracted at least one voice feature to the plurality of reference voice features includes applying a neural network.
  - 8. The method of claim 5, wherein the comparing the extracted at least one voice feature to the plurality of reference voice features includes applying a machine-learning algorithm configured to recognize patterns.
  - 9. The method of claim 1, wherein the recognizing the speech emotion in the audio stream includes:
    - applying natural language processing to detect speech in the audio stream;
      
      transforming the detected speech into text;
      
      recognizing at least one keyword or phrase in the text;
      
      determining the recognized at least one keyword or phrase is associated with a neutral emotion, a positive emotion, or a negative emotion; and
      
      based on the determined association of the recognized at least keyword or phrase, selecting the neutral emotion, the positive emotion, or the negative emotion as the speech emotion.
  - 10. The method of claim 1, wherein the recognizing the speech emotion in the audio stream includes:
    - identifying at least one keyword or phrase within the audio stream;
      
      determining the identified at least one keyword or phrase is associated with a neutral emotion; and
      
      selecting the neutral emotion as the speech emotion.
  - 11. The method of claim 1, wherein the recognizing the speech emotion in the audio stream includes:
    - identifying at least one keyword or phrase within the audio stream;
      
      determining the identified at least one keyword or phrase is associated with a positive emotion; and
      
      selecting the positive emotion as the speech emotion.
  - 12. The method of claim 1, wherein the recognizing the speech emotion in the audio stream includes:
    - identifying at least one keyword or phrase within the audio stream;
      
      determining the identified at least one keyword or phrase is associated with a negative emotion; and
      
      selecting the negative emotion as the speech emotion.

13. A computing device comprising:
- at least one processor; and
  
  a memory storing processor-executable codes, which, when implemented by the at least one processor, cause the computing device to;
  
  receive a video including a sequence of images and an audio stream from a video conference between a first user interface and a second user interface;
  
  detect a face of an individual in one or more of the images;
  
  recognize a speech emotion in the audio stream;
  
  generate a communication bearing data associated with the speech emotion;
  
  transmit the communication bearing data over a communications network; and
  
  switch the video conference from between the first user interface and the second user interface to between the first user interface and a third user interface responsive to the communication bearing data.
- View Dependent Claims (14, 15, 16, 17, 18, 19, 20)
- - 14. The computing device of claim 13, wherein the processor-executable code to recognize the speech emotion in the audio stream, causes the computing device to:
    - extract at least one voice feature from the audio stream;
      
      compare the extracted at least one voice feature to a plurality of reference voice features; and
      
      select the speech emotion based on the comparison of the extracted at least one voice feature to the plurality of reference voice features.
  - 15. The computing device of claim 14, wherein the extracted at least one voice feature includes at least one of maximum value of fundamental frequency, standard deviation of fundamental frequency, range of fundamental frequency, mean value of fundamental frequency, mean of bandwidth of first formant, mean of bandwidth of second formant, standard deviation of energy, speaking rate, slope of fundamental frequency, maximum value of first formant, maximum value of second formant, maximum value of energy, range of energy, range of second formant, and range of first formant.
  - 16. The computing device of claim 14, wherein the processor-executable code to compare the extracted at least one voice feature to the plurality of reference voice features causes the computing device to apply a neural network.
  - 17. The computing device of claim 14, wherein the processor-executable code to compare the extracted at least one voice feature to the plurality of reference voice features causes the computing device to apply a machine-learning algorithm configured to recognize patterns.
  - 18. The computing device of claim 13, wherein the processor-executable code to recognize the speech emotion in the audio stream, causes the computing device to:
    - apply natural language processing to detect speech in the audio stream;
      
      transform the detected speech into text;
      
      recognize at least one keyword or phrase in the text;
      
      determine the recognized at least one keyword or phrase is associated with a neutral emotion, a positive emotion, or a negative emotion andbased on the determined association of the recognized at least keyword or phrase, select the neutral emotion, the positive emotion, or the negative emotion as the speech emotion.
  - 19. The computing device of claim 13, wherein the processor-executable code to recognize the speech emotion in the audio stream, causes the computing device to:
    - identify at least one keyword or phrase within the audio stream;
      
      determine the identified at least one keyword or phrase is associated with a neutral emotion; and
      
      select the neutral emotion as the speech emotion.
  - 20. The computing device of claim 13, wherein the processor-executable code to recognize the speech emotion in the audio stream, causes the computing device to:
    - identify at least one keyword or phrase within the audio stream;
      
      determine the identified at least one keyword or phrase is associated with a positive emotion; and
      
      select the positive emotion as the speech emotion.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Snap, Inc.
Original Assignee
Snap, Inc.
Inventors
Shaburov, Victor, Monastyrshyn, Yurii
Primary Examiner(s)
Akhavannik, Hadi

Application Number

US15/829,543
Time in Patent Office

844 Days
Field of Search

None
US Class Current
CPC Class Codes

G06Q 30/0281   Customer communication at a...

G06T 2207/10016   Video; Image sequence

G06T 2207/30201   Face

G06T 7/337   involving reference images ...

G06T 7/344   involving models

G06V 10/7553   based on shape, e.g. active...

G06V 20/64   Three-dimensional objects

G06V 40/165   using facial parts and geom...

G06V 40/167   using comparisons between t...

G06V 40/171   Local features and componen...

G06V 40/176   Dynamic expression

G10L 25/57   for processing of video sig...

G10L 25/63   for estimating an emotional...

H04N 7/147   Communication arrangements,...

H04N 7/15   Conference systems

Emotion recognition in video conferencing

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Emotion recognition in video conferencing

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links