Speech signal separation and synthesis based on auditory scene analysis and speech modeling

US 9,536,540 B2
Filed: 07/18/2014
Issued: 01/03/2017
Est. Priority Date: 07/19/2013
Status: Active Grant

First Claim

Patent Images

1. A method for generating clean speech from a mixture of noise and speech, the method comprising:

deriving speech parameters, based on the mixture of noise and speech and a model of speech, the deriving using at least one hardware processor, wherein the deriving speech parameters comprises;

performing one or more spectral analyses on the mixture of noise and speech to generate one or more spectral representations;

deriving, based on the one or more spectral representations, feature data;

grouping target speech features in the feature data according to the model of speech;

separating the target speech features from the feature data; and

generating, based at least partially on the target speech features, the speech parameters; and

synthesizing, based at least partially on the speech parameters, clean speech.

View all claims

5 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Provided are systems and methods for generating clean speech from a speech signal representing a mixture of a noise and speech. The clean speech may be generated from synthetic speech parameters. The synthetic speech parameters are derived based on the speech signal components and a model of speech using auditory and speech production principles. The modeling may utilize a source-filter structure of the speech signal. One or more spectral analyzes on the speech signal are performed to generate spectral representations. The feature data is derived based on a spectral representation. The features corresponding to the target speech according to a model of speech are grouped and separated from the feature data. The synthetic speech parameters, including spectral envelope, pitch data and voice classification data are generated based on features corresponding to the target speech.

Citations

20 Claims

1. A method for generating clean speech from a mixture of noise and speech, the method comprising:
- deriving speech parameters, based on the mixture of noise and speech and a model of speech, the deriving using at least one hardware processor, wherein the deriving speech parameters comprises;
  
  performing one or more spectral analyses on the mixture of noise and speech to generate one or more spectral representations;
  
  deriving, based on the one or more spectral representations, feature data;
  
  grouping target speech features in the feature data according to the model of speech;
  
  separating the target speech features from the feature data; and
  
  generating, based at least partially on the target speech features, the speech parameters; and
  
  synthesizing, based at least partially on the speech parameters, clean speech.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The method of claim 1, wherein candidates for the target speech features are evaluated by a multi-hypothesis tracking system aided by the model of speech.
  - 3. The method of claim 1, wherein the speech parameters include spectral envelope and voicing information, the voicing information including pitch data and voice classification data.
  - 4. The method of claim 3, further comprising, prior to grouping the feature data, determining, based on a noise model, non-speech components in the feature data.
  - 5. The method of claim 4, wherein the pitch data are determined based, at least partially, on the non-speech components.
  - 6. The method of claim 4, wherein the pitch data are determined based, at least on, knowledge about where noise components occlude speech components.
  - 7. The method of claim 5, further comprising, while generating the speech parameters:
    - generating, based on the pitch data, a harmonic map, the harmonic map representing voiced speech; and
      
      estimating, based on the non-speech components and the harmonic map, an unvoiced speech map.
  - 8. The method of claim 7, further comprising extracting a sparse spectral envelope from the one or more spectral representations using a mask, the mask being generated based on a harmonic map and an unvoiced speech map.
  - 9. The method of claim 8, further comprising estimating the spectral envelope based on a sparse spectral envelope.
  - 10. The method of claim 3, wherein the pitch data are interpolated to fill missing frames before synthesizing clean speech.

11. A system for generating clean speech from a mixture of noise and speech, the system comprising:
- one or more processors; and
  
  a memory communicatively coupled with the processor, the memory storing instructions which if executed by the one or more processors perform a method comprising;
  
  deriving speech parameters, based on the mixture of noise and speech and a model of speech, wherein the deriving speech parameters comprises;
  
  performing one or more spectral analyses on the mixture of noise and speech to generate one or more spectral representations;
  
  deriving, based on the one or more spectral representations, feature data;
  
  grouping target speech features in the feature data according to the model of speech;
  
  separating the target speech features from the feature data; and
  
  generating, based at least partially on the target speech features, the speech parameters; and
  
  synthesizing, based at least partially on the speech parameters, clean speech.
- View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19)
- - 12. The system of claim 11, wherein candidates for the target speech features are evaluated by a multi-hypothesis tracking system aided by the model of speech.
  - 13. The system of claim 11, wherein the speech parameters include a spectral envelope and voicing information, the voicing information including pitch data and voice classification data.
  - 14. The system of claim 13, further comprising, prior to grouping the feature data, determining, based on a noise model, non-speech components in the feature data.
  - 15. The system of claim 14, wherein the pitch data are determined based partially on the non-speech components.
  - 16. The system of claim 14, wherein the pitch data are determined based, at least on, knowledge about where noise components occlude speech components.
  - 17. The system of claim 15, further comprising, while generating the speech parameters:
    - generating, based on the pitch data, a harmonic map, the harmonic map representing voiced speech; and
      
      estimating, based on the non-speech components and the harmonic map, an unvoiced speech map.
  - 18. The system of claim 15, further comprising extracting a sparse spectral envelope from the one or more spectral representations using a mask, the mask being generated based on a harmonic map and an unvoiced speech map.
  - 19. The system of claim 18, further comprising estimating the spectral envelope based on the sparse spectral envelope.

20. A non-transitory computer-readable storage medium having embodied thereon a program, the program being executable by a processor to perform a method for generating clean speech from a mixture of noise and speech, the method comprising:
- deriving speech parameters, based on the mixture of noise and speech and a model of speech, via instructions stored in the memory and executed by the one or more processors, wherein the deriving speech parameters comprises;
  
  performing one or more spectral analyses on the mixture of noise and speech to generate one or more spectral representations;
  
  deriving, based on the one or more spectral representations, feature data;
  
  grouping target speech features in the feature data according to the model of speech;
  
  separating the target speech features from the feature data; and
  
  generating, based at least partially on the target speech features, the speech parameters; and
  
  synthesizing, based at least partially on the speech parameters, via instructions stored in the memory and executed by the one or more processors, clean speech.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Samsung Electronics Co. Ltd.
Original Assignee
Knowles Electronics Llc (Knowles Corporation)
Inventors
Avendano, Carlos, Klein, David, Woodruff, John, Goodwin, Michael M.
Primary Examiner(s)
Pham, Thierry L

Application Number

US14/335,850
Publication Number

US 20150025881A1
Time in Patent Office

900 Days
Field of Search

704/9, 704/200, 704/247, 704/251, 704/275
US Class Current

1/1
CPC Class Codes

G10L 21/0208 Noise filtering

G10L 21/0272 Voice signal separating

Speech signal separation and synthesis based on auditory scene analysis and speech modeling

First Claim

5 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Speech signal separation and synthesis based on auditory scene analysis and speech modeling

First Claim

5 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links