Methods and systems for participant sourcing indication in multi-party conferencing and for audio source discrimination

US 7,916,848 B2
Filed: 10/01/2003
Issued: 03/29/2011
Est. Priority Date: 10/01/2003
Status: Active Grant

First Claim

Patent Images

1. A method, comprising:

via a first participant equipment;

detecting an acoustic signal;

determining whether the detected acoustic signal was generated by a person speaking by receiving a frame of audio data derived from the detected acoustic signal;

classifying the received frame based on spectral data of the received frame, the spectral data obtained by performing a modulated complex lapped transform (MOLT) on the frame of audio data, the classifying comprising classifying the received frame as one of the plurality of predetermined frame types comprising a live-type frame, a phone-type frame, and an unsure-type frame, wherein live-type frames represent frames determined to be derived from acoustic signals generated by a person speaking, and phone-type frames represent frames determined to be derived from acoustic signals generated by an audio transducer device; and

providing a signal indicating to a second participant equipment that the detected acoustic signal was generated by the person.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Indications of which participant is providing information during a multi-party conference. Each participant has equipment to display information being transferred during the conference. A sourcing signaler residing in the participant equipment provides a signal that indicates the identity of its participant when this participant is providing information to the conference. The source indicators of the other participant equipment receive the signal and cause a UI to indicate that the participant identified by the received signal is providing information (e.g. the UI can causes the identifier to change appearance). An audio discriminator is used to distinguish between an acoustic signal that was generated by a person speaking from that generated in a band-limited manner. The audio discriminator analyzes the spectrum of detected audio signals and generates several parameters from the spectrum and from past determinations to determine the source of an audio signal on a frame-by-frame basis.

33 Citations

View as Search Results

37 Claims

1. A method, comprising:
- via a first participant equipment;
  
  detecting an acoustic signal;
  
  determining whether the detected acoustic signal was generated by a person speaking by receiving a frame of audio data derived from the detected acoustic signal;
  
  classifying the received frame based on spectral data of the received frame, the spectral data obtained by performing a modulated complex lapped transform (MOLT) on the frame of audio data, the classifying comprising classifying the received frame as one of the plurality of predetermined frame types comprising a live-type frame, a phone-type frame, and an unsure-type frame, wherein live-type frames represent frames determined to be derived from acoustic signals generated by a person speaking, and phone-type frames represent frames determined to be derived from acoustic signals generated by an audio transducer device; and
  
  providing a signal indicating to a second participant equipment that the detected acoustic signal was generated by the person.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21)
- - 2. The method of claim 1, wherein determining whether the detected acoustic signal was generated by a person further comprises:
    - determining whether the detected acoustic signal was speech from an audio transducer device.
  - 3. The method of claim 1, wherein the signal is transmitted over a network to participants of a multi-party conference.
  - 4. The method of claim 1 wherein determining whether the detected acoustic signal was generated by a person further comprises:
    - determining a source of a portion of the detected acoustic signal used to derive the frame based on the classification of the frame and a prior determination of a source of a detected acoustic signal portion.
  - 5. The method of claim 4, wherein the spectral data is obtained by performing a frequency transform on the frame of audio data.
  - 6. The method of claim 5, wherein the spectral data is obtained by performing a fast Fourier transform (FFT) on the frame of audio data.
  - 7. The method of claim 4, further comprising determining a first frequency band'"'"'s energy and a second frequency band'"'"'s energy from the spectral data.
  - 8. The method of claim 7, wherein the first frequency band corresponds to a frequency range for consonants and the second frequency band corresponds to a frequency range for vowels.
  - 9. The method of claim 8, further comprising classifying the frame as being generated by a person when the ratio of the energies of the first and second frequency bands exceeds a first predetermined threshold.
  - 10. The method of claim 9, further comprising selectively classifying the frame as being from a different source when the ratio of the energies of the first and second frequency bands is below a second predetermined threshold.
  - 11. The method of claim 10, further comprising selectively classifying the frame as having an unknown source when the ratio of the energies of the first and second frequency bands exceeds the second predetermined threshold and is below the first predetermined threshold, the second predetermined threshold being less than the first predetermined threshold.
  - 12. The method of claim 8, further comprising determining whether the frame was derived from speech, wherein speech includes acoustic signals generated by a person speaking and acoustic signals generated by an audio transducer device.
  - 13. The method of claim 12, further comprising determining noise floor energies of the first and second frequency bands using the spectral data, wherein a frame is selectively classified as being derived from speech in response to the energy of the first frequency band exceeding the noise floor energy of the first frequency band or the energy of the second frequency band exceeding the noise floor of the second frequency band, or both.
  - 14. The method of claim 13, wherein classifying the received frame further comprises:
    - determining whether a frame received within a predetermined number of frames relative to the received frame has substantially all of its energy in the second frequency band.
  - 15. The method of claim 13, wherein classifying the received frame further comprises:
    - determining whether a frame adjacent to the received frame was classified as derived from speech.
  - 16. The method of claim 1, further comprising determining the source of the detected acoustic signal to be an acoustic signal generated by a person, if:
    - the prior determination of a source of a detected acoustic signal portion is that the source was an audio transducer device;
      
      the frame is classified as a live-type frame; and
      
      a predetermined number of prior frames includes live-type frames that exceed a predetermined live-type frame count threshold.
  - 17. The method of claim 1, further comprising determining the source of the detected acoustic signal to be unsure, if:
    - the prior determination of a source of a detected acoustic signal portion is that the source was an audio transducer device;
      
      the frame is classified as a live-type frame;
      
      a predetermined number of most recent frames do not include enough live-type frames to exceed a predetermined threshold; and
      
      an elapsed time since receiving a previous frame derived from speech exceeds a predetermined first time threshold.
  - 18. The method of claim 1, further comprising determining the source of the detected acoustic signal to be an audio transducer device, if:
    - the prior determination of a source of a detected acoustic signal portion is that the source was an acoustic signal generated by a person speaking;
      
      the frame is classified as a phone-type frame;
      
      an elapsed time since receiving a previous live-type frame exceeds a predetermined second time threshold; and
      
      a counter value does not exceed a predetermined count threshold, the counter value to track a number of consecutive non-live-type frames received after receiving a live-type frame of most recent frames do not include enough live-type frames to exceed a predetermined threshold.
  - 19. The method of claim 1, further comprising determining the source of the detected acoustic signal to be unsure, if:
    - the prior determination of a source of a detected acoustic signal portion is that the source was an acoustic signal generated by a person speaking;
      
      the frame is classified as a phone-type frame;
      
      an elapsed time since receiving a previous live-type frame exceeds a predetermined second time threshold; and
      
      the counter value is below a predetermined count threshold, the counter value to track a number of consecutive non-live-type frames received after receiving a live-type frame of most recent frames do not include enough live-type frames to exceed a predetermined threshold.
  - 20. The method of claim 1, further comprising determining the source of the detected acoustic signal to be an acoustic signal generated by a person speaking, if:
    - the prior determination of a source of a detected acoustic signal portion is unsure;
      
      the frame is classified as a live-type frame; and
      
      a predetermined number of most recent prior frames includes live-type frames that exceed in number a predetermined live-type frame count threshold.
  - 21. The method of claim 1, further comprising determining the source of the detected acoustic signal to be an acoustic transducer device, if:
    - the prior determination of a source of a detected acoustic signal portion is unsure;
      
      the frame is classified as a phone-type frame; and
      
      a predetermined number of most recent prior frames includes phone-type frames that exceed in number a predetermined phone-type frame count threshold.

22. A computer-readable tangible medium having computer-executable instructions that, upon execution, facilitate a computing device in performing operations comprising:
- detecting an acoustic signal;
  
  determining whether the detected acoustic signal was generated by a person speaking by receiving a frame of audio data derived from the detected acoustic signal;
  
  determining a source of the detected acoustic signal to be unsure, if;
  
  a prior determination of the source of the detected acoustic signal is that the source of the detected acoustic signal was an audio transducer device;
  
  the frame of audio data is classified as a live-type frame;
  
  a predetermined number of most recent frames do not include enough live-type frames to exceed a predetermined live-type frame count threshold; and
  
  an elapsed time since receiving a previous frame derived from speech exceeds a predetermined first time threshold;
  
  orthe prior determination of the source of the detected acoustic signal is that the source was the acoustic signal generated by a person speaking;
  
  the frame of audio data is classified as a phone-type frame;
  
  an elapsed time since receiving a previous live-type frame exceeds a predetermined second time threshold; and
  
  a counter value is below a predetermined count threshold, the counter value to track a number of consecutive non-live-type frames received after receiving the live-type frame of most recent frames does not include enough live-type frames to exceed the predetermined count threshold;
  
  determining the source of the detected acoustic signal to be the acoustic signal generated by the person speaking, if;
  
  the prior determination of a source of a detected acoustic signal is unsure;
  
  the frame of audio data is classified as the live-type frame; and
  
  the predetermined number of most recent prior frames includes live-type frames that exceed in number the predetermined live-type frame count threshold;
  
  determining the source of the detected acoustic signal to be the audio transducer device, if;
  
  the prior determination of the source of the detected acoustic signal is unsure;
  
  the frame of audio data is classified as the phone-type frame; and
  
  the predetermined number of most recent prior frames includes phone-type frames that exceed in number a predetermined phone-type frame count threshold;
  
  classifying the received frame of audio data based on spectral data of the received frame of audio data, the classifying comprising classifying the received frame of audio data as one of a plurality of predetermined frame types comprising the live-type frame;
  
  the phone-type frame; and
  
  an unsure-type frame, wherein live-type frames represent frames determined to be derived from acoustic signals generated by the person speaking, and phone-type frames represent frames determined to be derived from acoustic signals generated by the audio transducer device, wherein parameters used to classify frames include high band noise floor energy, low band noise floor energy, frame high band energy, frame low band energy, a ratio of the frame high band energy to the frame low band energy;
  
  classifying the received frame of audio data from non-spectral data of the received frame based on a parameters energy ratio threshold for live speech and an energy ratio for phone speech; and
  
  providing a signal indicating to a second computing device that the detected acoustic signal was generated by a person.
- View Dependent Claims (23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37)
- - 23. The computer-readable tangible medium of claim 22, wherein determining whether the detected acoustic signal was generated by a person further comprises:
    - determining whether the detected acoustic signal was speech from an audio transducer device.
  - 24. The computer-readable tangible medium of claim 22, wherein the signal is transmitted over a network to participants of a multi-party conference.
  - 25. The computer-readable tangible medium of claim 22, wherein determining whether the detected acoustic signal was generated by a person further comprises:
    - determining a source of a portion of the detected acoustic signal used to derive the frame based on the classification of the frame and a prior determination of a source of a detected acoustic signal portion.
  - 26. The computer-readable tangible medium of claim 25, wherein the spectral data is obtained by performing a frequency transform on the frame of audio data.
  - 27. The computer-readable tangible medium of claim 25, wherein the operations further comprise determining a first frequency band'"'"'s energy and a second frequency band'"'"'s energy from the spectral data.
  - 28. The computer-readable tangible medium of claim 27, wherein the first frequency band corresponds to a frequency range for consonants and the second frequency band corresponds to a frequency range for vowels.
  - 29. The computer-readable tangible medium of claim 28, wherein the operations further comprise:
    - classifying the frame as being generated by a person when the ratio of the energies of the first and second frequency bands exceeds a first predetermined threshold.
  - 30. The computer-readable tangible medium of claim 29, wherein the operations further comprise:
    - selectively classifying the frame as being from another source when the ratio of the energies of the first and second frequency bands is below a second predetermined threshold.
  - 31. The computer-readable tangible medium of claim 30, wherein the operations further comprise:
    - selectively classifying the frame as having an unknown source when the ratio of the energies of the first and second frequency bands exceeds the second predetermined threshold and is below the first predetermined threshold, the second predetermined threshold being less than the first predetermined threshold.
  - 32. The computer-readable tangible medium of claim 28, wherein the operations further comprise:
    - determining whether the frame was derived from speech, wherein speech includes acoustic signals generated by a person speaking and acoustic signals generated by an audio transducer device.
  - 33. The computer-readable tangible medium of claim 32, wherein the operations further comprise:
    - determining noise floor energies of the first and second frequency bands using the spectral data, wherein a frame is selectively classified as being derived from speech in response to the energy of the first frequency band exceeding the noise floor energy of the first frequency band or the energy of the second frequency band exceeding the noise floor of the second frequency band, or both.
  - 34. The computer-readable tangible medium of claim 33, wherein classifying the received frame further comprises:
    - determining whether a frame adjacent to the received frame was classified as derived from speech.
  - 35. The computer-readable tangible medium of claim 33, wherein classifying the received frame further comprises:
    - determining whether a frame within a predetermined number of frames relative to the received frame has substantially all of its energy in the second frequency band.
  - 36. The computer-readable tangible medium of claim 22, wherein the operations further comprise determining the source of the detected acoustic signal to be an acoustic signal generated by a person, if:
    - the prior determination of a source of a detected acoustic signal portion is that the source was an audio transducer device;
      
      the frame is classified as a live-type frame; and
      
      a predetermined number of prior frames includes live-type frames that exceed a predetermined live-type frame count threshold.
  - 37. The computer-readable tangible medium of claim 22, wherein the operations further comprise determining the source of the detected acoustic signal to be an audio transducer device, if:
    - the prior determination of a source of a detected acoustic signal portion is that the source was an acoustic signal generated by a person speaking;
      
      the frame is classified as a phone-type frame;
      
      an elapsed time since receiving a previous live-type frame exceeds a predetermined second time threshold; and
      
      a counter value does not exceed a predetermined count threshold, the counter value to track a number of consecutive non-live-type frames received after receiving a live-type frame of most recent frames do not include enough live-type frames to exceed a predetermined threshold.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Rui, Yong, Gupta, Anoop
Primary Examiner(s)
Al Aubaidi; Rasha S

Application Number

US10/677,213
Publication Number

US 20050076081A1
Time in Patent Office

2,736 Days
Field of Search

379/202.01, 379/201.01, 379/88.01, 348/14.01, 348/14.09, 704/259, 704/208, 704/275, 345/753, 381/66
US Class Current

379/202.01
CPC Class Codes

H04L 65/1101   Session protocols

H04L 65/4046   with distributed floor control

H04M 2201/38   Displays

H04M 2201/41   using speaker recognition s...

H04M 3/56   Arrangements for connecting...

H04Q 2213/13034   A/D conversion, code compre...

H04Q 2213/13091   CLI, identification of call...

H04Q 2213/1324   Conference call

H04Q 2213/13248   Multimedia

H04Q 2213/13333   Earth satellites

H04Q 2213/13337   Picturephone, videotelephony

H04Q 2213/13389   LAN, internet

Methods and systems for participant sourcing indication in multi-party conferencing and for audio source discrimination

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

33 Citations

37 Claims

Specification

Solutions

Use Cases

Quick Links

Methods and systems for participant sourcing indication in multi-party conferencing and for audio source discrimination

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

33 Citations

37 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links