Estimating pitch using peak-to-peak distances

US 9,842,611 B2
Filed: 12/15/2015
Issued: 12/12/2017
Est. Priority Date: 02/06/2015
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method for automatic speaker recognition, the method comprising:

obtaining a first portion of a speech signal;

computing, using one or more processing devices, a first frequency representation of the first portion of the speech signal;

obtaining a first threshold;

identifying a first plurality of peaks in the first frequency representation using the first threshold by identifying values of the first frequency representation larger than the first threshold;

computing, using the one or more processing devices, a first plurality of peak-to-peak distances using locations in frequency of the first plurality of peaks;

obtaining a second threshold;

identifying a second plurality of peaks in the first frequency representation using the second threshold by identifying values of the first frequency representation larger than the second threshold;

computing, using the one or more processing devices, a second plurality of peak-to-peak distances using locations in frequency of the second plurality of peaks;

computing, using the one or more processing devices, a first pitch estimate of the first portion of the speech signal using the first plurality of peak-to-peak distances and the second plurality of peak-to-peak distances;

obtaining a second portion of the speech signal;

computing, using the one or more processing devices, a second frequency representation of the second portion of the speech signal;

identifying a third plurality of peaks in the second frequency representation;

computing, using the one or more processing devices, a third plurality of peak-to-peak distances using locations in frequency of the third plurality of peaks;

computing, using the one or more processing devices, a second pitch estimate of the second portion of the speech signal using the third plurality of peak-to-peak distances;

generating, using the one or more processing devices, a sequence of pitch estimates, the sequence of pitch estimates comprising the first pitch estimate and the second pitch estimate; and

applying the sequence of pitch estimates to recognize a speaker as a source of the speech signal.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

An estimate of a pitch of a signal may be computed by using peak-to-peak distances in a frequency representation of the signal. A frequency representation of the signal may be computed, peaks in the frequency representation may be identified, for example, by identifying peaks larger than a threshold value. Peak-to-peak distances may be determined using the locations in frequency of the peaks. The pitch of the signal may be estimated by, for example, estimating cumulative distribution function of the peak-to-peak distances or computing a histogram of the peak-to-peak distances.

Citations

18 Claims

1. A computer-implemented method for automatic speaker recognition, the method comprising:
- obtaining a first portion of a speech signal;
  
  computing, using one or more processing devices, a first frequency representation of the first portion of the speech signal;
  
  obtaining a first threshold;
  
  identifying a first plurality of peaks in the first frequency representation using the first threshold by identifying values of the first frequency representation larger than the first threshold;
  
  computing, using the one or more processing devices, a first plurality of peak-to-peak distances using locations in frequency of the first plurality of peaks;
  
  obtaining a second threshold;
  
  identifying a second plurality of peaks in the first frequency representation using the second threshold by identifying values of the first frequency representation larger than the second threshold;
  
  computing, using the one or more processing devices, a second plurality of peak-to-peak distances using locations in frequency of the second plurality of peaks;
  
  computing, using the one or more processing devices, a first pitch estimate of the first portion of the speech signal using the first plurality of peak-to-peak distances and the second plurality of peak-to-peak distances;
  
  obtaining a second portion of the speech signal;
  
  computing, using the one or more processing devices, a second frequency representation of the second portion of the speech signal;
  
  identifying a third plurality of peaks in the second frequency representation;
  
  computing, using the one or more processing devices, a third plurality of peak-to-peak distances using locations in frequency of the third plurality of peaks;
  
  computing, using the one or more processing devices, a second pitch estimate of the second portion of the speech signal using the third plurality of peak-to-peak distances;
  
  generating, using the one or more processing devices, a sequence of pitch estimates, the sequence of pitch estimates comprising the first pitch estimate and the second pitch estimate; and
  
  applying the sequence of pitch estimates to recognize a speaker as a source of the speech signal.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method of claim 1, wherein computing the first pitch estimate of the first portion comprises estimating a cumulative distribution function of the first plurality of peak-to-peak distances and the second plurality of peak-to-peak distances.
  - 3. The method of claim 1, further comprising computing a histogram using the first plurality of peak-to-peak distances and the second plurality of peak-to-peak distances, and wherein computing the first pitch estimate of the first portion of the speech signal comprises computing the first pitch estimate using the histogram.
  - 4. The method of claim 1, wherein the first frequency representation is computed using an estimated fractional chirp rate of the first portion of the speech signal.
  - 5. The method of claim 1, wherein computing the first frequency representation comprises using a first smoothing kernel.
  - 6. The method of claim 1, wherein the first frequency representation comprises a log likelihood ratio (LLR) spectrum.
  - 7. The method of claim 1, wherein the first frequency representation comprises a stationary spectrum.

8. A system for automatic speech recognition, the system comprising one or more computing devices comprising at least one processor and at least one memory, the one or more computing devices configured to:
- obtain a first portion of a speech signal;
  
  compute a first frequency representation of the first portion of the speech signal;
  
  obtain a first threshold;
  
  identify a first plurality of peaks in the first frequency representation using the first threshold by identifying values of the first frequency representation larger than the first threshold;
  
  compute a first plurality of peak-to-peak distances using locations in frequency of the first plurality of peaks;
  
  obtain a second threshold;
  
  identify a second plurality of peaks in the first frequency representation using the second threshold by identifying values of the first frequency representation larger than the second threshold;
  
  compute a second plurality of peak-to-peak distances using locations in frequency of the second plurality of peaks;
  
  compute a first pitch estimate of the first portion of the speech signal using the first plurality of peak-to-peak distances and the second plurality of peak-to-peak distances;
  
  obtain a second portion of the speech signal;
  
  compute a second frequency representation of the second portion of the speech signal;
  
  identify a third plurality of peaks in the second frequency representation;
  
  compute a third plurality of peak-to-peak distances using locations in frequency of the third plurality of peaks;
  
  compute a second pitch estimate of the second portion of the speech signal using the third plurality of peak-to-peak distances;
  
  generate a sequence of pitch estimates, the sequence of pitch estimates comprising the first pitch estimate and the second pitch estimate; and
  
  apply the sequence of pitch estimates to perform automatic speech recognition on the speech signal.
- View Dependent Claims (9, 10, 11, 12, 13, 14)
- - 9. The system of claim 8, wherein the one or more computing devices are further configured to compute the first pitch estimate of the first portion by estimating a cumulative distribution function of the first plurality of peak-to-peak distances and the second plurality of peak-to-peak distances.
  - 10. The system of claim 8, wherein the one or more computing devices are further configured to compute a histogram using the first plurality of peak-to-peak distances and the second plurality of peak-to-peak distances, and generate the first pitch estimate of the first portion speech-signal using the histogram.
  - 11. The system of claim 8, wherein the one or more computing devices are further configured to compute the first frequency representation using a first smoothing kernel.
  - 12. The system of claim 8, wherein the first frequency representation comprises a log-likelihood ratio (LLR) spectrum.
  - 13. The system of claim 8, wherein the one or more computing devices are further configured to:
    - compute the first pitch estimate of the first portion of the speech signal by identifying a most frequently occurring peak-to-peak distance from the first plurality of peak-to-peak distances and the second plurality of peak-to-peak distances.
  - 14. The system of claim 11, wherein the one or more computing devices are further configured to:
    - compute a third frequency representation of the first portion of the speech signal using a second smoothing kernel;
      
      identify a fourth plurality of peaks in the third frequency representation;
      
      compute a fourth plurality of peak-to-peak distances using locations in frequency of the fourth plurality of peaks; and
      
      compute a third pitch estimate of the first portion of the speech signal using the fourth plurality of peak-to-peak distances.

15. One or more non-transitory computer-readable media comprising computer executable instructions that, when executed, cause at least one processor to perform actions comprising:
- obtaining a first portion of a speech signal;
  
  computing a first frequency representation of the first portion of the speech signal;
  
  obtaining a first threshold;
  
  identifying a first plurality of peaks in the first frequency representation using the first threshold by identifying values of the first frequency representation larger than the first threshold;
  
  computing a first plurality of peak-to-peak distances using locations in frequency of the first plurality of peaks;
  
  obtaining a second threshold;
  
  identifying a second plurality of peaks in the first frequency representation using the second threshold by identifying values of the first frequency representation larger than the second threshold;
  
  computing a second plurality of peak-to-peak distances using locations in frequency of the second plurality of peaks;
  
  computing a first pitch estimate of the first portion of the speech signal using the first plurality of peak-to-peak distances and the second plurality of peak-to-peak distances;
  
  obtaining a second portion of the speech signal;
  
  computing a second frequency representation of the second portion of the speech signal;
  
  identifying a third plurality of peaks in the second frequency representation;
  
  computing a third plurality of peak-to-peak distances using locations in frequency of the third plurality of peaks;
  
  computing a second pitch estimate of the second portion of the speech signal using the third plurality of peak-to-peak distances;
  
  generating a sequence of pitch estimates, the sequence of pitch estimates comprising the first pitch estimate and the second pitch estimate; and
  
  applying the sequence of pitch estimates to recognize a speaker as a source of the speech signal.
- View Dependent Claims (16, 17, 18)
- - 16. The one or more non-transitory computer-readable media of claim 15, wherein computing the first pitch estimate of the first portion comprises estimating a cumulative distribution function of the first plurality of peak-to-peak distances and the second plurality of peak-to-peak distances.
  - 17. The one or more non-transitory computer-readable media of claim 15, further comprising computing a histogram using the first plurality of peak-to-peak distances and the second plurality of peak-to-peak distances, and wherein computing the first pitch estimate of the first portion of the speech signal comprises computing the first pitch estimate using the histogram.
  - 18. The one or more non-transitory computer-readable media of claim 15, wherein the first frequency representation comprises a log-likelihood ratio (LLR) spectrum.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Friday Harbor LLC
Original Assignee
Knuedge, Inc.
Inventors
Bradley, David C., Morin, Yao Huang, Marongelli, Ellisha
Primary Examiner(s)
AZAD, ABUL K

Application Number

US14/969,038
Publication Number

US 20160232925A1
Time in Patent Office

728 Days
Field of Search
US Class Current
CPC Class Codes

G10L 25/18 the extracted parameters be...

G10L 25/90 Pitch determination of spee...

Estimating pitch using peak-to-peak distances

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

Citations

18 Claims

Specification

Solutions

Use Cases

Quick Links

Estimating pitch using peak-to-peak distances

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

18 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links