Method and system for the automatic segmentation of an audio stream into semantic or syntactic units

US 20020010575A1
Filed: 08/02/2001
Published: 01/24/2002
Est. Priority Date: 04/08/2000
Status: Active Grant

First Claim

Patent Images

1. A method for the segmentation of an audio stream into semantic or syntactic units wherein the audio stream is provided in a digitized format, comprising the steps of:

determining a fundamental frequency for the digitized audio stream;

detecting changes of the fundamental frequency in the audio stream;

determining candidate boundaries for the semantic or syntactic units depending on the detected changes of the fundamental frequency;

extracting at least one prosodic feature in the neighborhood of the candidate boundaries;

determining boundaries for the semantic or syntactic units depending on the at least one prosodic feature.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A digitized speech signal (600) is input to an F0 (fundamental frequency) processor that computes (610) a continuous F0 data from the speech signal. By the criterion voicing state transition (voiced/unvoiced transitions) the speech signal is presegmented (620) into segments. For each segment (630) it is evaluated (640) whether F0 is defined or not defined i.e. whether F0 is ON or OFF. In case of F0=OFF a candidate segment boundary is assumed as described above and, starting from that boundary, prosodic features are computed (650). The feature values are input into a classification tree and each candidate segment is classified thereby revealing, as a result, the existence or non-existence of a semantic or syntactic speech unit.

Citations

14 Claims

1. A method for the segmentation of an audio stream into semantic or syntactic units wherein the audio stream is provided in a digitized format, comprising the steps of:
- determining a fundamental frequency for the digitized audio stream;
  
  detecting changes of the fundamental frequency in the audio stream;
  
  determining candidate boundaries for the semantic or syntactic units depending on the detected changes of the fundamental frequency;
  
  extracting at least one prosodic feature in the neighborhood of the candidate boundaries;
  
  determining boundaries for the semantic or syntactic units depending on the at least one prosodic feature.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The method according to claim 1, wherein providing a threshold value for the voicedness of the fundamental frequency estimates and determining whether the voicedness of fundamental frequency estimates is lower than the threshold value.
  - 3. The method according to claim 2, wherein defining an index function for the fundamental frequency having a value =0 if the voicedness of the fundamental frequency is lower than the threshold value and having a value =1 if the voicedness of the fundamental frequency is higher than the threshold value.
  - 4. The method according to claim 3, wherein extracting at least one prosodic feature in an environment of the audio stream where the value of the index function is equal 0.
  - 5. The method according to claim 4, wherein the environment is a time period between 500 and 4000 milliseconds.
  - 6. The method according to claim 1, wherein the at least one prosodic feature is represented by the fundamental frequency.
  - 7. The method according to claim 1, wherein the extracting step involves extracting at least two prosodic features and combining the at least two prosodic features.
  - 8. The method according to claim 1, further comprising first detecting speech and non-speech segments in the digitized audio stream and performing the steps of claim 1 thereafter only for detected speech segments.
  - 9. The method according to claim 8, wherein the detecting of speech and non-speech segments comprises utilizing the signal energy or signal energy changes, respectively, in the audio stream.
  - 10. The method according to claim 1, further comprising the step of performing a prosodic feature classification based on a predetermined classification tree.

11. An article of manufacture comprising a computer usable medium having computer readable program code means embodied therein for causing segmentation of an audio stream into semantic or syntactic units, wherein the audio stream is provided in a digitized format, the computer readable program code means in the article of manufacture comprising computer readable program code means for causing a computer to effect:
- determining a fundamental frequency for the digitized audio stream;
  
  detecting changes of the fundamental frequency in the audio stream;
  
  determining candidate boundaries for the semantic or syntactic units depending on the detected changes of the fundamental frequency;
  
  extracting at least one prosodic feature in the neighborhood of the candidate boundaries;
  
  determining boundaries for the semantic or syntactic units depending on the at least one prosodic feature.

12. A digital audio processing system for segmentation of a digitized audio stream into semantic or syntactic units comprising:
- means for determining a fundamental frequency for the digitized audio stream, means for detecting changes of the fundamental frequency in the audio stream, means for determining candidate boundaries for the semantic or syntactic units depending on the detected changes of the fundamental frequency, means for extracting at least one prosodic feature in the neighborhood of the candidate boundaries, and means for determining boundaries for the semantic or syntactic units depending on the at least one prosodic feature.
- View Dependent Claims (13, 14)
- - 13. An audio processing system according to claim 12, further comprising means for generating an index function for the voicedness of the fundamental frequency having a value =0 if the voicedness of the fundamental frequency is lower than a predetermined threshold value and having a value =1 if the voicedness fundamental frequency is higher than the threshold value.
  - 14. Audio processing system according to claim 12 or 13, further comprising means for detecting speech and non-speech segments in the digitized audio stream, particularly for detecting and analyzing the signal energy or signal energy changes, respectively, in the audio stream.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Original Assignee
International Business Machines Corporation
Inventors
Stenzel, Gerhard, Kriechbaum, Werner, Haase, Martin

Granted Patent

US 7,120,575 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/205
CPC Class Codes

G10L 15/1807 using prosody or stress

G10L 25/87 Detection of discrete point...

Method and system for the automatic segmentation of an audio stream into semantic or syntactic units

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

14 Claims

Specification

Solutions

Use Cases

Quick Links

Method and system for the automatic segmentation of an audio stream into semantic or syntactic units

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

14 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links