Speech enhancement with low-order non-negative matrix factorization

US 10,276,179 B2
Filed: 06/16/2017
Issued: 04/30/2019
Est. Priority Date: 03/06/2017
Status: Active Grant

First Claim

Patent Images

1. A method performed by a computing device for enhancing speech, the method comprising:

accessing multiple dictionaries of dictionary atoms, the dictionaries being generated from clean speech samples by performing a non-negative matrix factorization (“

NMF”

) of frequency-domain (“

FD”

) clean speech sample representations of the clean speech samples, each NMF having a unique initialization, wherein each of the multiple dictionaries comprises a reduced number of dictionary atoms to conserve processing power;

receiving noisy speech;

generating a FD noisy speech representation of the noisy speech;

for each of the multiple dictionaries, generating a FD clean speech representation corresponding to the FD noisy speech representation by performing a NMF of the FD noisy speech representation based on the dictionary atoms of the dictionaries;

generating an enhanced FD clean speech representation of the noisy speech by combining the FD clean speech representations generated using each dictionary with the reduced number of dictionary atoms, the combining includes averaging the FD clean speech representations; and

converting the enhanced FD clean speech representation into clean speech that represents an enhancement of the noisy speech.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system is provided that employs a statistical approach to semi-supervised speech enhancement with a low-order non-negative matrix factorization (“NMF”). The system enhances noisy speech based on multiple dictionaries with dictionary atoms derived from the same clean speech samples and generates an enhanced speech representation of the noisy speech by combining, for each dictionary, a clean speech representation of the noisy speech generated based on a NMF using the dictionary atoms of the dictionary. The system generates frequency-domain (“FD”) clean speech sample representations of the clean speech samples, for example, using a Fourier transform. To generate each dictionary, the system generates a dictionary-unique initialization of the dictionary atoms and the activations and performs a NMF of the FD clean speech samples.

Citations

22 Claims

1. A method performed by a computing device for enhancing speech, the method comprising:
- accessing multiple dictionaries of dictionary atoms, the dictionaries being generated from clean speech samples by performing a non-negative matrix factorization (“
  
  NMF”
  
  ) of frequency-domain (“
  
  FD”
  
  ) clean speech sample representations of the clean speech samples, each NMF having a unique initialization, wherein each of the multiple dictionaries comprises a reduced number of dictionary atoms to conserve processing power;
  
  receiving noisy speech;
  
  generating a FD noisy speech representation of the noisy speech;
  
  for each of the multiple dictionaries, generating a FD clean speech representation corresponding to the FD noisy speech representation by performing a NMF of the FD noisy speech representation based on the dictionary atoms of the dictionaries;
  
  generating an enhanced FD clean speech representation of the noisy speech by combining the FD clean speech representations generated using each dictionary with the reduced number of dictionary atoms, the combining includes averaging the FD clean speech representations; and
  
  converting the enhanced FD clean speech representation into clean speech that represents an enhancement of the noisy speech.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method of claim 1 wherein averaging the FD clean speech representations includes iteratively performing the following steps until each dictionary has been selected:
    - selecting a dictionary of the multiple dictionaries of dictionary atoms;
      
      obtaining a non-negative maximum a posteriori probability estimate of a time-frequency component; and
      
      generating a running total of the FD clean speech representations; and
      
      dividing the running total by a number of the multiple dictionaries to generate the enhanced FD clean speech representation.
  - 3. The method of claim 2 wherein the combining further includes fusing the non-negative maximum a posté
    - riori probability of spectral components with phase information.
  - 4. The method of claim 3 further comprising generating a mean and variance based on the FD clean speech representations.
  - 5. The method of claim 1 further comprising determining a phase associated with the FD noisy speech representation and wherein the converting of the enhanced FD clean speech representation factors in the phase.
  - 6. The method of claim 1 further comprising generating the dictionaries by:
    - receiving clean speech samples;
      
      generating FD clean speech sample representations of the clean speech samples; and
      
      for each of the dictionaries,generating initial dictionary atoms and activations based on an initialization strategy; and
      
      performing a NMF starting with the initial dictionary atoms and activations and adjusting the dictionary atoms and activations until a convergence criterion to the FD clean speech sample representations is satisfied.
  - 7. The method of claim 1 wherein the performing of the NMF of the FD noisy speech representation based on the dictionary atoms of the dictionary includes:
    - generating initial activations based on an initialization strategy; and
      
      performing a NMF starting with the dictionary atoms and the initial activations and adjusting the activations until a convergence criterion to the FD noisy speech representations is satisfied.

8. A computing system for enhancing speech, the computing system comprising:
- one or more computer-readable storage media storing computer-executable instructions that, when executed, cause the computing system to;
  
  access multiple dictionaries of dictionary atoms;
  
  receive a frequency-domain (“
  
  FD”
  
  ) noisy speech representation of noisy speech;
  
  for each of the multiple dictionaries, generate a FD clean speech representation corresponding to the FD noisy speech representation by performing a non-negative matrix factorization (“
  
  NMF”
  
  ) of the FD noisy speech representation based on the dictionary atoms of the dictionary, wherein each of the multiple dictionaries comprises a reduced number of dictionary atoms to conserve processing power; and
  
  generate an enhanced FD clean speech representation by combining the FD clean speech representations generated using each dictionary with the reduced number of dictionary atoms, the combining includes averaging the FD clean speech representations; and
  
  one or more processors for executing the computer-executable instructions stored in the one or more computer-readable storage media.
- View Dependent Claims (9, 10, 11, 12, 13, 14, 15, 16, 17)
- - 9. The computing system of claim 8 wherein the computer-executable instructions include instructions that convert the enhanced FD clean speech representation into clean speech that represents an enhancement of the noisy speech.
  - 10. The computing system of claim 9 wherein the computer-executable instructions include instructions that generate the FD noisy speech representation of the noisy speech.
  - 11. The computing system of claim 10 wherein the computer-executable instructions include instructions that determine a phase associated with the FD noisy speech representation and wherein the instructions that convert the enhanced FD clean speech representation factors in the phase.
  - 12. The computing system of claim 8 wherein the computer-executable instructions include instructions that averaging the generated FD clean speech representations includes iteratively performing the following steps until each dictionary has been selected:
    - selecting a dictionary of the multiple dictionaries of dictionary atoms;
      
      obtaining a non-negative maximum a posté
      
      riori probability estimate of a time-frequency component; and
      
      generating a running total of the FD clean speech representations; and
      
      dividing the running total by a number of the multiple dictionaries to generate the enhanced FD clean speech representation.
  - 13. The computing system of claim 12 wherein the combining further includes fusing the non-negative maximum a posté
    - riori probability of spectral components with phase information.
  - 14. The computing system of claim 13 wherein the computer-executable instructions include instructions that generate a mean and variance based on the FD clean speech representations.
  - 15. The computing system of claim 8 wherein the computer-executable instructions include instructions that generate the dictionaries by:
    - receiving clean speech samples;
      
      generating FD clean speech sample representations of the clean speech samples; and
      
      for each of the dictionaries,generating initial dictionary atoms and activations based on an initialization strategy; and
      
      performing a NMF starting with the initial dictionary atoms and activations and adjusting the dictionary atoms and activations until a convergence criterion with the generated FD clean speech sample representations is satisfied.
  - 16. The computing system of claim 8 wherein the computer-executable instructions that perform the NMF of the FD noisy speech representation based on the dictionary atoms of the dictionary includes instructions that:
    - generate initial activations for a speech portion and initializations and activations for a noisy portion of the noisy speech based on an initialization strategy; and
      
      perform a NMF starting with the dictionary atoms and the initial activations for the speech portion and initial atoms and activations for the noise portion and adjusting the activations for the speech portion and the atoms and activations for the noise portion until a convergence criterion to the FD noisy speech representation is satisfied.
  - 17. The computing system of claim 8 wherein the instructions to generate the FD clean speech representation are executed in parallel by the one or more processors.

18. A method performed by a computing device for enhancing speech, the method comprising:
- receiving noisy speech;
  
  generating a FD noisy speech representation of the noisy speech;
  
  for each of multiple dictionaries, generating a FD clean speech representation corresponding to the FD noisy speech representation by performing a NMF of the FD noisy speech representation based on dictionary atoms of the dictionary, wherein each dictionary represents a different NMF based on the same clean speech samples, and wherein each of the multiple dictionaries comprises a reduced number of dictionary atoms to conserve processing power;
  
  generating an enhanced FD clean speech representation of the noisy speech by combining the generated FD clean speech representations generated using each dictionary with the reduced number of dictionary atoms, the combining includes averaging the FD clean speech representations by iteratively performing the following steps until each dictionary has been selected;
  
  selecting a dictionary of the multiple dictionaries of dictionary atoms;
  
  obtaining a non-negative maximum a posté
  
  riori probability estimate of a time-frequency component; and
  
  generating a running total of the FD clean speech representations; and
  
  dividing the running total by a number of the multiple dictionaries to generate the enhanced FD clean speech representation; and
  
  converting the enhanced FD clean speech representation into clean speech that represents an enhancement of the noisy speech.
- View Dependent Claims (19, 20, 21, 22)
- - 19. The method of claim 18 further comprising generating the dictionaries by:
    - receiving clean speech samples;
      
      generating FD clean speech sample representations of the clean speech samples; and
      
      for each of the dictionaries,generating initial dictionary atoms and activations; and
      
      performing a NMF starting with the initial dictionary atoms and activations and adjusting the dictionary atoms and activations until a convergence criterion is satisfied.
  - 20. The method of claim 18 wherein the performing of the NMF of the FD noisy speech representation based on the dictionary atoms of the dictionary includes:
    - generating initial activations; and
      
      performing a NMF starting with the dictionary atoms and the initial activations and adjusting the activations until a convergence criterion is satisfied.
  - 21. The method of claim 18 wherein the FD clean speech representations for the dictionaries are generated in parallel.
  - 22. The method of claim 21 wherein the generating of the FD clean speech representation for each of the multiple dictionaries are performed by a separate thread of execution for each dictionary.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Inventors
Tashev, Ivan Jelev, Zarar, Shuayb M
Primary Examiner(s)
Sirjani, Fariba

Application Number

US15/626,016
Publication Number

US 20180254050A1
Time in Patent Office

683 Days
Field of Search

None
US Class Current
CPC Class Codes

G06F 40/242   Dictionaries

G10L 21/0208   Noise filtering

G10L 21/0232   Processing in the frequency...

G10L 21/0364   for improving intelligibility

Speech enhancement with low-order non-negative matrix factorization

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

22 Claims

Specification

Solutions

Use Cases

Quick Links

Speech enhancement with low-order non-negative matrix factorization

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

22 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links