Identification of the presence of speech in digital audio data

US 8,036,884 B2
Filed: 02/24/2005
Issued: 10/11/2011
Est. Priority Date: 02/26/2004
Status: Expired due to Fees

First Claim

Patent Images

1. A method for causing an audio data processing apparatus to determine speech related audio data within a recording of digital audio data based on transitions between voiced and unvoiced sequences, the method comprising:

extracting, in the audio data processing apparatus, audio features from the recording of digital audio data at an analyzing apparatus;

classifying, in the audio data processing apparatus, the recording of digital audio data based on the extracted audio features and with respect to one or more predetermined audio classes stored in an electronic memory of the apparatus;

marking, in the audio data processing apparatus, at least a part of the recording of digital audio data classified as speech, wherein the extraction of at least one audio feature includes partitioning the recording of digital audio data into adjoining frames;

defining, in the audio data processing apparatus and for each frame, a window being formed by a sequence of adjoining frames containing a frame under consideration;

determining, in the audio data processing apparatus, for the frame under consideration, and at least one next frame of the window, a spectral-emphasis-value which is related to a frequency distribution contained in the digital audio data of a respective frame and which represents a frequency at which a main audio energy is contained in the respective frame, the main audio energy indicating a major part of the audio energy in the respective frame, and classifying the frame under consideration as containing voiced or unvoiced audio data based on the spectral-emphasis-value of the frame under consideration; and

assigning, in the audio data processing apparatus, a presence-of-speech indicator value to the frame under consideration based on an evaluation of the differences between the spectral-emphasis-values determined for the frame under consideration and the at least one next frame of the window, said presence-of-speech indicator value being based on a detection of transitions between frames containing voiced and unvoiced audio data.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The present invention provides a method, a computer-software-product and an apparatus for enabling a determination of speech related audio data within a record of digital audio data. The method comprises steps for extracting audio features from the record of digital audio data, for classifying one or more subsections of the record of digital audio data, and for marking at least a part of the record of digital audio data classified as speech. The classification of the digital audio data record is performed on the basis of the extracted audio features and with respect to at least one predetermined audio class. The extraction of the at least one audio feature as used by a method according to the invention comprises steps for partitioning the record of digital audio data into adjoining frames, defining a window for each frame which is formed by a sequence of adjoining frames containing the frame under consideration, determining for the frame under consideration and at least one further frame of the window a spectral-emphasis-value which is related to the frequency distribution contained in the digital audio data of the respective frame, and assigning a presence-of-speech indicator value to the frame under consideration based on an evaluation of the differences between the spectral-emphasis-values determined for the frame under consideration and at least one further frame of the window.

40 Citations

View as Search Results

17 Claims

1. A method for causing an audio data processing apparatus to determine speech related audio data within a recording of digital audio data based on transitions between voiced and unvoiced sequences, the method comprising:
- extracting, in the audio data processing apparatus, audio features from the recording of digital audio data at an analyzing apparatus;
  
  classifying, in the audio data processing apparatus, the recording of digital audio data based on the extracted audio features and with respect to one or more predetermined audio classes stored in an electronic memory of the apparatus;
  
  marking, in the audio data processing apparatus, at least a part of the recording of digital audio data classified as speech, wherein the extraction of at least one audio feature includes partitioning the recording of digital audio data into adjoining frames;
  
  defining, in the audio data processing apparatus and for each frame, a window being formed by a sequence of adjoining frames containing a frame under consideration;
  
  determining, in the audio data processing apparatus, for the frame under consideration, and at least one next frame of the window, a spectral-emphasis-value which is related to a frequency distribution contained in the digital audio data of a respective frame and which represents a frequency at which a main audio energy is contained in the respective frame, the main audio energy indicating a major part of the audio energy in the respective frame, and classifying the frame under consideration as containing voiced or unvoiced audio data based on the spectral-emphasis-value of the frame under consideration; and
  
  assigning, in the audio data processing apparatus, a presence-of-speech indicator value to the frame under consideration based on an evaluation of the differences between the spectral-emphasis-values determined for the frame under consideration and the at least one next frame of the window, said presence-of-speech indicator value being based on a detection of transitions between frames containing voiced and unvoiced audio data.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
- - 2. The method according to claim 1, wherein the extraction of the at least one audio feature is based on the recording of digital audio data providing the digital audio data in a time domain representation.
  - 3. The method according to claim 1, wherein the evaluation of the differences between the spectral-emphasis-values determined for the frame under consideration and the at least one next frame of the window is effected by determining the difference between the maximum spectral-emphasis-value and the minimum spectral-emphasis-value determined.
  - 4. The method according to claim 1, wherein the evaluation of the differences between the spectral-emphasis-values determined for the frame under consideration and the at least one next frame of the window is effected by forming the standard deviation of the spectral-emphasis-values determined for the frame under consideration and the at least one next frame of the window.
  - 5. The method according to claim 1, wherein the spectral-emphasis-value of a frame is determined by applying the SpectralCentroid operator to the digital audio data forming the frame.
  - 6. The method according to claim 1, wherein the spectral-emphasis-value of a frame is determined by applying the AverageLSPP operator to the digital audio data forming the frame.
  - 7. The method according to claim 1, wherein the window defined for a frame under consideration is formed by a sequence of an odd number of adjoining frames with the frame under consideration being located in the middle of the sequence.
  - 8. The method according to claim 1, wherein a frame is formed by a subsection of the record digital audio data defining an interval, the interval corresponding to a time period between 10 ms to 30 ms.
  - 9. The method according to claim 1, wherein a spectral-emphasis-value is equally determined, in the audio data processing apparatus, for the frame under consideration and at least one preceding and at least one following frame.
  - 10. The method according to claim 5, wherein the SpectralCentroid operator is defined as
  - 11. The method according to claim 10, wherein detection of transitions between voiced and unvoiced sequences is based on a voiced/unvoiced transition detection function, which is defined by
  - 12. The method according to claim 11, wherein the spectral-emphasis-value of a frame is determined by applying the SpectralCentroid operator in addition to another audio feature to define a hybrid audio feature set
    HybridFeatureSet _f_i=[vud(f_i),MFCC′
    - _f_i],where vud(f_j) is a voiced/unvoiced transition detection function.
  - 13. The method according to claim 6, wherein the AverageLSPP operator is defined as
  - 14. The method according to claim 13, wherein detection of transitions between voiced and unvoiced sequences is based on a voiced/unvoiced transition detection function, which is defined by
  - 15. The method according to claim 14, wherein the spectral-emphasis-value of a frame is determined by applying the AverageLSPP operator in addition to another audio feature to define a hybrid audio feature set
    HybridFeatureSet _f_i=[vud(f_i),MFCC'"'"'_f_i],where vud(f_j) is a voiced/unvoiced transition detection function.

16. A non-transitory computer-readable medium having computer-readable instructions thereon, the instructions when executed by a computer cause the computer to perform a method for determining speech related audio data within a recording of digital audio data based on transitions between voiced and unvoiced sequences, comprising:
- extracting audio features from the recording of digital audio data at an analyzing apparatus;
  
  classifying the recording of digital audio data based on the extracted audio features and with respect to one or more predetermined audio classes stored in an electronic memory of the apparatus;
  
  marking at least a part of the recording of digital audio data classified as speech, wherein the extraction of at least one audio feature includes partitioning the recording of digital audio data into adjoining frames;
  
  defining, for each frame, a window formed by a sequence of an odd number of adjoining frames with the frame under consideration located in the middle of the sequence;
  
  determining for the frame under consideration and at least one next frame of the window a spectral-emphasis-value which is related to a frequency distribution contained in the digital audio data of a respective frame and which represents a frequency at which a main audio energy is contained in the respective frame, the main audio energy indicating a major part of the audio energy in the respective frame, and classifying the frame under consideration as containing voiced or unvoiced audio data based on the spectral-emphasis-value of the frame under consideration; and
  
  assigning a presence-of-speech indicator value to the frame under consideration based on an evaluation of the differences between the spectral-emphasis-values determined for the frame under consideration and the at least one next frame of the window, said presence-of-speech indicator value being based on a detection of transitions between frames containing voiced and unvoiced audio data.

17. An audio data processing apparatus for determining speech related audio data within a recording of digital audio data based on transitions between voiced and unvoiced sequences, comprising:
- an extraction unit configured to extract audio features from a recording of digital audio data, includinga defining unit configured to define, for each frame, a window formed by a sequence of an odd number of adjoining frames with the frame under consideration located in the middle of the sequence,a determining unit configured to determine for the frame under consideration and at least one next frame of the window a spectral-emphasis-value which is related to a frequency distribution contained in the digital audio data of a respective frame and which represents a frequency at which a main audio energy is contained in the respective frame, the main audio energy indicating a major part of the audio energy in the respective frame, and classifying the frame under consideration as containing voiced or unvoiced audio data based on the spectral-emphasis-value of the frame under consideration, andan assigning unit configured to assign a presence-of-speech indicator value to the frame under consideration based on an evaluation of the differences between the spectral-emphasis-values determined for the frame under consideration and the at least one next frame of the window, said presence-of-speech indicator value being based on a detection of transitions between frames containing voiced and unvoiced audio data;
  
  a classification unit configured to classify the recording of digital audio data based on the extracted audio features and with respect to one or more predetermined audio classes stored in an electronic memory of the classification unit; and
  
  a marking unit configured to mark at least a part of the recording of digital audio data classified as speech, wherein the extraction of at least one audio feature includes partitioning the recording of digital audio data into adjoining frames.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Sony Deutschland GmbH (Sony Group Corp.)
Original Assignee
Sony Deutschland GmbH (Sony Group Corp.)
Inventors
Sola I Caros, Josep Maria, Lam, Yin Hay
Primary Examiner(s)
Chawan; Vijay B.
Assistant Examiner(s)
COLUCCI, MICHAEL C

Application Number

US11/065,555
Publication Number

US 20050192795A1
Time in Patent Office

2,420 Days
Field of Search

704/21, 704/208, 704/245, 704/253, 704/221, 704/205, 704/206, 704/211, 704/214, 704/216, 704/219, 704/222, 704/223, 704/227, 704/258, 704/266, 704/270, 704/500, 846/09, 846/22, 700/94
US Class Current

704/205
CPC Class Codes

G10H 2210/046 for differentiation between...

G10L 25/78 Detection of presence or ab...

Identification of the presence of speech in digital audio data

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

40 Citations

17 Claims

Specification

Solutions

Use Cases

Quick Links

Identification of the presence of speech in digital audio data

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

40 Citations

17 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links