Pitch detection of speech signals

US 20050149321A1
Filed: 09/23/2004
Published: 07/07/2005
Est. Priority Date: 09/26/2003
Status: Active Grant

First Claim

Patent Images

1. A system for determining a pitch of speech from a speech signal, the system including:

(1) an input device to receive the speech and generate the speech signal; and

(2) a processor structured to;

(a) distinguish the speech signal into voiced, unvoiced or silenced sections using speech signal energy levels;

(b) apply a Fourier Transform to the voiced speech signal section and obtain speech signal parameters;

(c) determine peaks of the Fourier transformed voiced speech signal section;

(d) track the speech signal parameters of the determined peaks to select partials; and

(e) determine the pitch from the selected partials using a two-way mismatch error calculation.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Pitch detection of speech signals finds numerous applications in karaoke, voice recognition and scoring applications. While most of the existing techniques rely on time domain methods, the invention utilizes frequency domain methods. There is provided a method and system for determining the pitch of speech from a speech signal. The method includes the steps of: producing or obtaining the speech signal; distinguishing the speech signal into voiced, unvoiced or silence sections using speech signal energy levels; applying a Fourier Transform to the speech signal and obtaining speech signal parameters; determining peaks of the Fourier transformed speech signal; tracking the speech signal parameters of the determined peaks to select partials; and determining the pitch from the selected partials using a two-way mismatch error calculation.

Citations

41 Claims

1. A system for determining a pitch of speech from a speech signal, the system including:
- (1) an input device to receive the speech and generate the speech signal; and
  
  (2) a processor structured to;
  
  (a) distinguish the speech signal into voiced, unvoiced or silenced sections using speech signal energy levels;
  
  (b) apply a Fourier Transform to the voiced speech signal section and obtain speech signal parameters;
  
  (c) determine peaks of the Fourier transformed voiced speech signal section;
  
  (d) track the speech signal parameters of the determined peaks to select partials; and
  
  (e) determine the pitch from the selected partials using a two-way mismatch error calculation.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The system according to claim 1, wherein the speech signal is a coded, compressed or real-time audio or data signal.
  - 3. The system according to claim 1, adapted to perform real-time processing of live speech signals.
  - 4. The system according to claim 1, wherein the speech signal is a Pulse Code Modulated signal.
  - 5. The system according to claim 1, wherein the system is incorporated into a karaoke system, computer system or voice recognition system.
  - 6. The system according to claim 1, wherein the input device is a microphone or audio receiver.

7. A method of determining a pitch of speech from a speech signal, the method including the steps of:
- producing or obtaining the speech signal;
  
  distinguishing the speech signal into voiced, unvoiced or silenced sections using speech signal energy levels;
  
  applying a Fourier Transform to the voiced speech signal section and obtaining speech signal parameters;
  
  determining peaks of the Fourier transformed voiced speech signal section;
  
  tracking the speech signal parameters of the determined peaks to select partials; and
  
  determining the pitch from the selected partials using a two-way mismatch error calculation.
- View Dependent Claims (8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27)
- - 8. The method according to claim 7, wherein prior to applying the Fourier Transform a windowing procedure is applied to the voiced speech signal section.
  - 9. The method according to claim 8, wherein the windowing procedure utilizes a Blackman window, a Kaiser window, a Raised Cosine window or other sinusoidal models.
  - 10. The method according to claim 7, wherein applying the Fourier Transform comprises applying the Fourier Transform to a frame of the voiced speech signal section.
  - 11. The method according to claim 10, wherein the frame is one of a plurality of overlapping frames.
  - 12. The method according to claim 10, wherein the signal parameters form trajectories that are tracked over a selected number of frames of the voiced speech signal section.
  - 13. The method according to claim 12, wherein trajectories persisting over more than one frame of the selected number of frames are utilized.
  - 14. The method according to claim 7, wherein the Fourier Transform is a Fast Fourier Transform.
  - 15. The method according to claim 7, wherein the speech signal parameters are frequency, phase and amplitude.
  - 16. The method according to claim 7, wherein a zero padding procedure is used in determining the peaks of the Fourier transformed voiced speech signal section.
  - 17. The method according to claim 7, wherein a frequency of a determined peak falling within a specified frequency range of a frequency of a harmonic of the pitch is set equal to the frequency of the harmonic.
  - 18. The method according to claim 7, wherein the peaks are determined in an amplitude spectrum.
  - 19. The method according to claim 18, wherein the peaks are determined in the amplitude spectrum using a logarithmic scale.
  - 20. The method according to claim 7, wherein the partials are selected from the determined peaks based on a greatest common divisor of a maximum number of partials in a voiced speech signal section spectrum.
  - 21. The method according to claim 7, wherein the two-way mismatch error calculation compares a frequency of each selected partial to a frequency of a nearest predicted harmonic and a frequency of each predicted harmonic to a frequency of a nearest selected partial to provide a total error.
  - 22. The method according to claim 21, wherein the total error is normalized, and adjusted using a signal-to-noise ratio.
  - 23. The method according to claim 21, wherein the pitch is determined as corresponding to a minimum of the total error.
  - 24. The method according to claim 7, wherein the speech signal energy levels are short-term signal energy levels.
  - 25. The method according to claim 7, wherein distinguishing the speech signal further comprises utilizing an energy estimation calculation.
  - 26. The method according to claim 7, wherein determining the pitch further comprises determining the pitch using a localized frequency range.
  - 27. The method according to claim 26, wherein the localized frequency range is about 50-500 Hz.

28. A system for determining a pitch of speech from a speech signal, the system comprising:
- (1) an input device to receive the speech and generate the speech signal; and
  
  (2) a processor structured to;
  
  (a) distinguish the speech signal into voiced, unvoiced or silenced speech signal sections using speech signal energy levels;
  
  (b) apply a windowing procedure to the voiced speech signal section to generate a frame;
  
  (c) apply a Fourier Transform to the frame and obtain speech signal parameters;
  
  (d) determine peaks of the Fourier transformed frame;
  
  (e) track the speech signal parameters of the determined peaks to select partials; and
  
  (f) determine the pitch from the selected partials using a two-way mismatch error calculation.
- View Dependent Claims (29, 30, 31, 32, 33, 34, 35, 36, 37, 38)
- - 29. The system of claim 28, wherein the windowing procedure utilizes a Blackman window, a Kaiser window, a Raised Cosine window or other sinusoidal models.
  - 30. The system of claim 28, wherein the frame is one of a plurality of overlapping frames.
  - 31. The system of claim 28, wherein the signal parameters form trajectories that are tracked over a selected number of frames of the voiced speech signal section.
  - 32. The system of claim 31, wherein trajectories persisting over more than one frame of the selected number of frames are utilized.
  - 33. The system of claim 28, wherein the Fourier Transform is a Fast Fourier Transform.
  - 34. The system of claim 28, wherein the processor is further adapted to determine peaks of the Fourier transformed frame using a zero padding procedure.
  - 35. The system of claim 28, wherein the processor is further adapted to set a frequency of a determined peak falling within a specified frequency range of a frequency of a harmonic of the pitch equal to the frequency of the harmonic.
  - 36. The system of claim 28, wherein the processor is further configured to select partials from the determined peaks based on a greatest common divisor of a maximum number of partials in the Fourier transformed frame.
  - 37. The system of claim 28, wherein the two-way mismatch error calculation compares a frequency of each selected partial to a frequency of a nearest predicted harmonic and a frequency of each predicted harmonic to a frequency of a nearest selected partial to provide a total error.
  - 38. The system of claim 37, wherein the pitch is determined as corresponding to a minimum of the total error.

39. A system for estimating a pitch of speech from a speech signal, the system including:
- (1) an input device to receive the speech and produce the speech signal;
  
  (2) a memory unit or storage unit adapted to communicate required data to a processing unit; and
  
  (3) the processing unit operating on the speech signal and structured to;
  
  (a) section the speech signal into voiced, unvoiced or silenced sections using speech signal energy levels;
  
  (b) apply a Fast Fourier Transform to the voiced speech signal section and generate speech signal parameters;
  
  (c) determine peaks of the Fourier transformed voiced speech signal section;
  
  (d) track the speech signal parameters of the determined peaks to select partials; and
  
  (e) calculate the pitch from the selected partials using a two-way mismatch error calculation.
- View Dependent Claims (40)
- - 40. The system as claimed in claim 39, wherein the Fast Fourier Transform operates on a frame of a windowed portion of the speech signal, and the speech signal parameters are tracked over more than one frame.

41. A system for determining a pitch of speech from a speech signal, comprising:
- means for producing or obtaining the speech signal;
  
  means for distinguishing the speech signal into voiced, unvoiced or silenced speech signal sections using speech signal energy levels;
  
  means for applying a Fourier Transform to the voiced speech signal section and obtaining speech signal parameters;
  
  means for determining peaks of the Fourier transformed voiced speech signal section;
  
  means for tracking the speech signal parameters of the determined peaks to select partials; and
  
  means for determining the pitch from the selected partials using a two-way mismatch error calculation.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
STMicroelectronics International N.V. (STMicroelectronics NV)
Original Assignee
STMicroelectronics Asia Pacific Pte Limited (STMicroelectronics NV)
Inventors
George, Sapna, Kabi, Prakash Padhi

Granted Patent

US 7,660,718 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/207
CPC Class Codes

G10L 25/90 Pitch determination of spee...

Pitch detection of speech signals

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

41 Claims

Specification

Solutions

Use Cases

Quick Links

Pitch detection of speech signals

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

41 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links