Identification of the presence of speech in digital audio data
First Claim
1. A method for causing an audio data processing apparatus to determine speech related audio data within a recording of digital audio data based on transitions between voiced and unvoiced sequences, the method comprising:
- extracting, in the audio data processing apparatus, audio features from the recording of digital audio data at an analyzing apparatus;
classifying, in the audio data processing apparatus, the recording of digital audio data based on the extracted audio features and with respect to one or more predetermined audio classes stored in an electronic memory of the apparatus;
marking, in the audio data processing apparatus, at least a part of the recording of digital audio data classified as speech, wherein the extraction of at least one audio feature includes partitioning the recording of digital audio data into adjoining frames;
defining, in the audio data processing apparatus and for each frame, a window being formed by a sequence of adjoining frames containing a frame under consideration;
determining, in the audio data processing apparatus, for the frame under consideration, and at least one next frame of the window, a spectral-emphasis-value which is related to a frequency distribution contained in the digital audio data of a respective frame and which represents a frequency at which a main audio energy is contained in the respective frame, the main audio energy indicating a major part of the audio energy in the respective frame, and classifying the frame under consideration as containing voiced or unvoiced audio data based on the spectral-emphasis-value of the frame under consideration; and
assigning, in the audio data processing apparatus, a presence-of-speech indicator value to the frame under consideration based on an evaluation of the differences between the spectral-emphasis-values determined for the frame under consideration and the at least one next frame of the window, said presence-of-speech indicator value being based on a detection of transitions between frames containing voiced and unvoiced audio data.
2 Assignments
0 Petitions
Accused Products
Abstract
The present invention provides a method, a computer-software-product and an apparatus for enabling a determination of speech related audio data within a record of digital audio data. The method comprises steps for extracting audio features from the record of digital audio data, for classifying one or more subsections of the record of digital audio data, and for marking at least a part of the record of digital audio data classified as speech. The classification of the digital audio data record is performed on the basis of the extracted audio features and with respect to at least one predetermined audio class. The extraction of the at least one audio feature as used by a method according to the invention comprises steps for partitioning the record of digital audio data into adjoining frames, defining a window for each frame which is formed by a sequence of adjoining frames containing the frame under consideration, determining for the frame under consideration and at least one further frame of the window a spectral-emphasis-value which is related to the frequency distribution contained in the digital audio data of the respective frame, and assigning a presence-of-speech indicator value to the frame under consideration based on an evaluation of the differences between the spectral-emphasis-values determined for the frame under consideration and at least one further frame of the window.
40 Citations
17 Claims
-
1. A method for causing an audio data processing apparatus to determine speech related audio data within a recording of digital audio data based on transitions between voiced and unvoiced sequences, the method comprising:
-
extracting, in the audio data processing apparatus, audio features from the recording of digital audio data at an analyzing apparatus; classifying, in the audio data processing apparatus, the recording of digital audio data based on the extracted audio features and with respect to one or more predetermined audio classes stored in an electronic memory of the apparatus; marking, in the audio data processing apparatus, at least a part of the recording of digital audio data classified as speech, wherein the extraction of at least one audio feature includes partitioning the recording of digital audio data into adjoining frames; defining, in the audio data processing apparatus and for each frame, a window being formed by a sequence of adjoining frames containing a frame under consideration; determining, in the audio data processing apparatus, for the frame under consideration, and at least one next frame of the window, a spectral-emphasis-value which is related to a frequency distribution contained in the digital audio data of a respective frame and which represents a frequency at which a main audio energy is contained in the respective frame, the main audio energy indicating a major part of the audio energy in the respective frame, and classifying the frame under consideration as containing voiced or unvoiced audio data based on the spectral-emphasis-value of the frame under consideration; and assigning, in the audio data processing apparatus, a presence-of-speech indicator value to the frame under consideration based on an evaluation of the differences between the spectral-emphasis-values determined for the frame under consideration and the at least one next frame of the window, said presence-of-speech indicator value being based on a detection of transitions between frames containing voiced and unvoiced audio data. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
-
-
16. A non-transitory computer-readable medium having computer-readable instructions thereon, the instructions when executed by a computer cause the computer to perform a method for determining speech related audio data within a recording of digital audio data based on transitions between voiced and unvoiced sequences, comprising:
-
extracting audio features from the recording of digital audio data at an analyzing apparatus; classifying the recording of digital audio data based on the extracted audio features and with respect to one or more predetermined audio classes stored in an electronic memory of the apparatus; marking at least a part of the recording of digital audio data classified as speech, wherein the extraction of at least one audio feature includes partitioning the recording of digital audio data into adjoining frames; defining, for each frame, a window formed by a sequence of an odd number of adjoining frames with the frame under consideration located in the middle of the sequence; determining for the frame under consideration and at least one next frame of the window a spectral-emphasis-value which is related to a frequency distribution contained in the digital audio data of a respective frame and which represents a frequency at which a main audio energy is contained in the respective frame, the main audio energy indicating a major part of the audio energy in the respective frame, and classifying the frame under consideration as containing voiced or unvoiced audio data based on the spectral-emphasis-value of the frame under consideration; and assigning a presence-of-speech indicator value to the frame under consideration based on an evaluation of the differences between the spectral-emphasis-values determined for the frame under consideration and the at least one next frame of the window, said presence-of-speech indicator value being based on a detection of transitions between frames containing voiced and unvoiced audio data.
-
-
17. An audio data processing apparatus for determining speech related audio data within a recording of digital audio data based on transitions between voiced and unvoiced sequences, comprising:
-
an extraction unit configured to extract audio features from a recording of digital audio data, including a defining unit configured to define, for each frame, a window formed by a sequence of an odd number of adjoining frames with the frame under consideration located in the middle of the sequence, a determining unit configured to determine for the frame under consideration and at least one next frame of the window a spectral-emphasis-value which is related to a frequency distribution contained in the digital audio data of a respective frame and which represents a frequency at which a main audio energy is contained in the respective frame, the main audio energy indicating a major part of the audio energy in the respective frame, and classifying the frame under consideration as containing voiced or unvoiced audio data based on the spectral-emphasis-value of the frame under consideration, and an assigning unit configured to assign a presence-of-speech indicator value to the frame under consideration based on an evaluation of the differences between the spectral-emphasis-values determined for the frame under consideration and the at least one next frame of the window, said presence-of-speech indicator value being based on a detection of transitions between frames containing voiced and unvoiced audio data; a classification unit configured to classify the recording of digital audio data based on the extracted audio features and with respect to one or more predetermined audio classes stored in an electronic memory of the classification unit; and a marking unit configured to mark at least a part of the recording of digital audio data classified as speech, wherein the extraction of at least one audio feature includes partitioning the recording of digital audio data into adjoining frames.
-
Specification