Emotional speech processing

US 10,127,927 B2
Filed: 06/18/2015
Issued: 11/13/2018
Est. Priority Date: 07/28/2014
Status: Active Grant

First Claim

Patent Images

1. A method, comprising:

receiving one or more speech samples, wherein the one or more speech samples are characterized by one or more emotions or speaking styles from one or more speakers;

generating a set of training data by extracting one or more acoustic features from every frame of the one or more sample speeches; and

generating a model from the set of training data, wherein the model identifies emotion or speaking style dependent information in the set of training data, wherein the model includes the application of a Probabilistic Linear Discriminant Analysis (PLDA) to identify an emotion related subspace.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method for emotion or speaking style recognition and/or clustering comprises receiving one or more speech samples, generating a set of training data by extracting one or more acoustic features from every frame of the one or more speech samples, and generating a model from the set of training data, wherein the model identifies emotion or speaking style dependent information in the set of training data. The method may further comprise receiving one or more test speech samples, generating a set of test data by extracting one or more acoustic features from every frame of the one or more test speeches, and transforming the set of test data using the model to better represent emotion/speaking style dependent information, and use the transformed data for clustering and/or classification to discover speech with similar emotion or speaking style. It is emphasized that this abstract is provided to comply with the rules requiring an abstract that will allow a searcher or other reader to quickly ascertain the subject matter of the technical disclosure. It is submitted with the understanding that it will not be used to interpret or limit the scope or meaning of the claims.

70 Citations

29 Claims

1. A method, comprising:
- receiving one or more speech samples, wherein the one or more speech samples are characterized by one or more emotions or speaking styles from one or more speakers;
  
  generating a set of training data by extracting one or more acoustic features from every frame of the one or more sample speeches; and
  
  generating a model from the set of training data, wherein the model identifies emotion or speaking style dependent information in the set of training data, wherein the model includes the application of a Probabilistic Linear Discriminant Analysis (PLDA) to identify an emotion related subspace.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
- - 2. The method of claim 1, wherein generating a model includes generating a PLDA model represented by PLDA parameters.
  - 3. The method of claim 2, further comprising:
    - receiving one or more test speech samples;
      
      generating a set of test data by extracting one or more acoustic features from every frame of the one or more test speech samples; and
      
      transforming the set of test data into transformed data using the PLDA model to capture emotion and/or speaking style in the transformed data; and
      
      using the transformed data for clustering and/or classification to discover speech with emotion or speaking styles similar to that captured in the transformed data.
  - 4. The method of claim 3, wherein the one or more test speeches includes one or more speakers and one or more emotions or speaking styles different from the one or more sample speeches.
  - 5. The method of claim 3, wherein transforming the set of test data includes transforming the set of test data into dimension reduced GMM supervectors using PLDA.
  - 6. The method of claim 3, wherein using the transformed data for clustering and/or classification to discover speech with similar emotion or speaking styles that captured in the transformed data includes:
    - adapting one or more neutral/read speech trained models to a specific emotion/emotions using the transformed data; and
      
      performing speech recognition using the one or more adapted models.
  - 7. The method of claim 3, wherein using the transformed data for clustering and/or classification to discover speech with similar emotion or speaking styles to that captured in the transformed data includes:
    - training one or more emotional speech models from scratch using the transformed data; and
      
      performing speech recognition using the one or more trained emotional models.
  - 8. The method of claim 3, wherein generating a set of test data includes model adaptation for modelling the extracted acoustic features as a Gaussian Mixture Model (GMM) and representing the set of test data with GMM mean supervectors.
  - 9. The method of claim 3, further comprising augmenting the classification and/or clustering with supplemental emotion classification using emotion recognition done in parallel by one or more methods other than analysis of speech samples.
  - 10. The method of claim 1, wherein generating a set of training data includes model adaptation for modelling the extracted acoustic features as a Gaussian Mixture Model (GMM) and representing the set of training data with GMM mean supervectors.
  - 11. The method of claim 1, wherein the one or more sample speeches are captured by a local microphone.
  - 12. The method of claim 1, wherein the one or more speeches are received over a network or from a local storage device.
  - 13. The method of claim 1, further comprising saving or transmitting the model, or applying the model to test data to characterize a speaking style or emotion in the test data.

14. A system, comprising:
- a processor module;
  
  a memory coupled to the processor, wherein the memory contains executable instructions configured to implement a method, the method comprising;
  
  receiving one or more speech samples;
  
  generating a set of training data by extracting one or more acoustic features from every frame of the one or more speech samples; and
  
  generating a model from the set of training data, wherein the model identifies emotion or speaking style dependent information in the set of training data, wherein the model includes the application of a Probabilistic Linear Discriminant Analysis (PLDA) to identify an emotion related subspace.
- View Dependent Claims (15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27)
- - 15. The system of claim 14, wherein generating a model includes generating a PLDA model represented by PLDA parameters.
  - 16. The system of claim 14, generating a set of training data includes model adaptation for modelling the extracted acoustic features as a Gaussian Mixture Model (GMM) and representing the set of training data with GMM mean supervectors.
  - 17. The system of claim 14, the one or more sample speeches include a plurality of emotions or speaking styles with the one or more sample speeches from one or more persons for each emotion or speaking style.
  - 18. The system of claim 14, wherein the one or more speech samples are captured by a microphone.
  - 19. The system of claim 14, further comprising the microphone.
  - 20. The system of claim 14, wherein the one or more speech samples are received over a network or received from a local storage device.
  - 21. The system of claim 14, further comprising a network interface or the local storage device.
  - 22. The system of claim 14, wherein the method further comprises:
    - receiving one or more test speech samples;
      
      generating a set of test data by extracting one or more acoustic features from every frame of the one or more test speech samples;
      
      transforming the set of test data into transformed data using the PLDA model to capture emotion and/or speaking style in the transformed data; and
      
      using the transformed data for clustering and/or classification to discover speech with emotion or speaking styles similar to that captured in the transformed data.
  - 23. The system of claim 22, wherein the one or more test speech samples includes one or more speakers and one or more emotions or speaking styles different from one or more speech samples in the training data.
  - 24. The system of claim 22, wherein transforming the set of test data includes transforming the set of test data into dimension reduced GMM supervectors using Probabilistic Linear Discriminant Analysis (PLDA).
  - 25. The system of claim 22, wherein using the transformed data for clustering and/or classification to discover speech with similar emotion or speaking styles that captured in the transformed data includes:
    - adapting one or more neutral/read speech trained models to a specific emotion/emotions using the transformed data; and
      
      performing speech recognition using the one or more adapted models.
  - 26. The system of claim 22, wherein using the transformed data for clustering and/or classification to discover speech with similar emotion or speaking styles that captured in the transformed data includes:
    - training one or more emotional speech models from scratch using the transformed data; and
      
      performing speech recognition using the one or more trained emotional models.
  - 27. The system of claim 22, wherein the method further comprises augmenting the classification and/or clustering with supplemental emotion classification using emotion recognition done in parallel by one or more methods other than analysis of speech samples.

28. A non-transitory computer readable medium having embodied therein computer readable instructions configured, to implement a method, the method comprising:
- receiving one or more speech samples, wherein the one or more speech samples are characterized by one or more emotions or speaking styles from one or more speakers;
  
  generating a set of training data by extracting one or more acoustic features from every frame of the one or more sample speeches; and
  
  generating a model from the set of training data, wherein the model identifies emotion or speaking style dependent information in the set of training data, wherein the model includes the application of a Probabilistic Linear Discriminant Analysis (PLDA) to identify an emotion related subspace.
- View Dependent Claims (29)
- - 29. The non-transitory computer readable medium of claim 28, wherein the method further comprises:
    - receiving one or more test speech samples;
      
      generating a set of test data by extracting one or more acoustic features from every frame of the one or more test speech samples;
      
      transforming the set of test data into transformed data using a Probabilistic Linear Discriminant Analysis (PLDA) model to capture emotion and/or speaking style in the transformed data; and
      
      using the transformed data for clustering and/or classification to discover speech with emotion or speaking styles similar to that captured in the transformed data.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Sony Interactive Entertainment Inc. (Sony Group Corp.)
Original Assignee
Sony Interactive Entertainment Inc. (Sony Group Corp.)
Inventors
Kalinli-Akbacak, Ozlem, Chen, Ruxin
Primary Examiner(s)
RodriguezGonzalez, Lennin

Application Number

US14/743,673
Publication Number

US 20160027452A1
Time in Patent Office

1,244 Days
Field of Search

None
US Class Current
CPC Class Codes

G10L 15/063   Training

G10L 15/07   to the speaker

G10L 17/26   Recognition of special voic...

G10L 25/63   for estimating an emotional...

Emotional speech processing

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

70 Citations

29 Claims

Specification

Solutions

Use Cases

Quick Links

Emotional speech processing

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

70 Citations

29 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links