Energy-Efficient Unobtrusive Identification of a Speaker

US 20120303369A1
Filed: 05/26/2011
Published: 11/29/2012
Est. Priority Date: 05/26/2011
Status: Active Grant

First Claim

Patent Images

1. A method for identifying a speaker in an energy-efficient manner, comprising:

at a first processing unit;

determining whether a received audio signal satisfies an audio strength test; and

if the audio strength test is satisfied, determining whether the audio signal contains human speech; and

at a second processing unit;

identifying whether speech frames, obtained from the audio signal, have a quality which satisfies an admission test, the speech frames which satisfy the admission test comprising admitted frames; and

associating a speaker model with the admitted frames, the speaker model corresponding to a particular user,operations performed by the first processing unit having a first power expenditure and operations performed by the second processing unit having a second power expenditure, the first power expenditure being less than the second power expenditure.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Functionality is described herein for recognizing speakers in an energy-efficient manner. The functionality employs a heterogeneous architecture that comprises at least a first processing unit and a second processing unit. The first processing unit handles a first set of audio processing tasks (associated with the detection of speech) while the second processing unit handles a second set of audio processing tasks (associated with the identification of speakers), where the first set of tasks consumes less power than the second set of tasks. The functionality also provides unobtrusive techniques for collecting audio segments for training purposes. The functionality also encompasses new applications which may be invoked in response to the recognition of speakers.

246 Citations

20 Claims

1. A method for identifying a speaker in an energy-efficient manner, comprising:
- at a first processing unit;
  
  determining whether a received audio signal satisfies an audio strength test; and
  
  if the audio strength test is satisfied, determining whether the audio signal contains human speech; and
  
  at a second processing unit;
  
  identifying whether speech frames, obtained from the audio signal, have a quality which satisfies an admission test, the speech frames which satisfy the admission test comprising admitted frames; and
  
  associating a speaker model with the admitted frames, the speaker model corresponding to a particular user,operations performed by the first processing unit having a first power expenditure and operations performed by the second processing unit having a second power expenditure, the first power expenditure being less than the second power expenditure.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
- - 2. The method of claim 1, wherein the operations performed by the first processing unit are performed on a more frequent basis than the operations performed by the second processing unit.
  - 3. The method of claim 1, wherein the first processing unit and the second processing unit are implemented by a portable communication device.
  - 4. The method of claim 1, wherein said determining of whether the audio signal satisfies the audio strength test comprises:
    - periodically determining whether the audio signal exceeds a prescribed strength threshold for a predetermined number of audio samples within a prescribed period of time; and
      
      if the audio signal satisfies the audio strength test, activating a speech detection module to perform said determining of whether the audio signal contains human speech.
  - 5. The method of claim 1, wherein said determining of whether the audio signal contains human speech comprises:
    - sampling the audio signal at a prescribed rate and dividing the audio signal into audio frames, a window comprising a plurality of consecutive audio frames;
      
      determining window features associated with each window;
      
      determining, using a classifier module, whether the window features indicate that the audio signal contains human speech for each window; and
      
      if the audio signal contains human speech, activating the second processing unit.
  - 6. The method of claim 1, wherein said identifying of whether the speech frames have a quality which satisfies the admission test comprises:
    - determining frame admission metrics for the speech frames;
      
      identifying, based on the frame admission metrics, the admitted frames; and
      
      activating a speaker identification module to perform said associating of the speaker model with the admitted frames.
  - 7. The method of claim 1, wherein said associating of the speaker model with the admitted frames comprises:
    - receiving the admitted frames;
      
      generating speaker identification features based on the admitted frames; and
      
      determining a speaker model which matches the speaker identification features over a smoothing window, the smoothing window being made up of a plurality of admitted frames.
  - 8. The method of claim 1, wherein said associating of the speaker model with the admitted frames comprises:
    - identifying a context associated with production of the audio signal; and
      
      selecting the speaker model from a subset of possible speaker models that are associated with the context.
  - 9. The method of claim 1, further comprising:
    - obtaining, from a local participant, an audio segment that captures a telephone conversation between the local participant and a remote participant;
      
      generating a speaker model for the local participant based on a part of the audio segment that is attributed to the local participant; and
      
      generating a speaker model for the remote participant based on a part of the audio segment that is attributed to the remote participant, wherein said generating of the speaker model for the remote participant employs one or more compensation techniques to address channel distortion in the part of the audio segment that is attributed to the remote participant.
  - 10. The method of claim 9, wherein the part of the audio segment that is attributed to the local participant is associated with a first channel and the part of the audio segment that is attributed to the remote participant is associated with a second channel.
  - 11. The method of claim 1, further comprising:
    - obtaining, from a first participant, an audio segment that captures a one-on-one conversation between a first participant and a second participant;
      
      identifying a part of the audio segment that is attributed to the first participant and a part of the audio segment that is attributed to the second participant using an audio segmentation algorithm; and
      
      generating a speaker model for the second participant based on the part of the audio segment that is attributed to the second participant.
  - 12. The method of claim 11, wherein the audio segment algorithm comprises using at least a speaker model for the first participant to identify the part of the audio segment that is attributed to the first participant.
  - 13. The method of claim 1, further comprising at least one of:
    - transferring a speaker model of a first participant to a second participant; and
      
      transferring a speaker model of the second participant to the first participant.
  - 14. The method of claim 1, further comprising performing an action in response to said associating of the speaker model with the admitted frames, the action comprising at least one of:
    - identifying a name of the particular user when the particular user speaks;
      
      associating life-logging data with the particular user;
      
      filtering social network data based on identification of the particular user; and
      
      generating a reminder that is associated with the particular user.

15. A device for determining the identities of users, comprising:
- a first processing unit including logic for determining whether an audio signal contains human speech; and
  
  a second processing unit including logic for identifying, when invoked by the first processing unit, a speaker who is associated with the audio signal by comparing the audio signal with one or more speaker models,operations performed by the first processing unit being performed on a more frequent basis than operations performed by the second processing unit, andthe operations performed by the first processing unit having a first power expenditure and the operations performed by the second processing unit having a second power expenditure, the first power expenditure being less than the second power expenditure.
- View Dependent Claims (16)
- - 16. The device of claim 15, wherein the device comprises a portable communication device.

17. A method, implemented by computing functionality, for generating one or more speaker models, comprising:
- obtaining an audio segment in an unobtrusive manner from a first participant, the audio segment capturing a conversation between the first participant and a second participant;
  
  generating a speaker model for at least the second participant based on the audio segment; and
  
  downloading the speaker model of said at least the second participant to a user device associated with the first participant, the user device including functionality for using the speaker model to identify the second participant using a heterogeneous energy-efficient processing architecture.
- View Dependent Claims (18, 19, 20)
- - 18. The method of claim 17, wherein the conversation pertains to a telephone conversation, and wherein the first participant is a local participant of the telephone conversation and the second participant is a remote participant of the telephone conversation, and wherein said generating comprises:
    - generating a speaker model for the first participant based on a part of the audio segment that is attributed to the first participant; and
      
      generating the speaker model for the second participant based on a part of the audio segment that is attributed to the second participant.
  - 19. The method of claim 18, wherein said generating of the speaker model for the second participant involves employing one or more compensation techniques to address channel distortion.
  - 20. The method claim 17, wherein the conversation pertains to a one-on-one conversation between the first participant and the second participant, and wherein said generating comprises:
    - identifying a part of the audio segment that is attributed to the first participant and a part of the audio segment that is attributed to the second participant using an audio segmentation algorithm; and
      
      generating the speaker model for the second participant based on the part of the audio segment that is attributed to the second participant.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Brush, Alice Jane B., Priyantha, Nissanka Arachchige Bodhi, Liu, Jie, Karlson, Amy K., Lu, Hong

Granted Patent

US 8,731,936 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/246
CPC Class Codes

G10L 17/02   Preprocessing operations, e...

G10L 17/04   Training, enrolment or mode...

G10L 25/78   Detection of presence or ab...

Energy-Efficient Unobtrusive Identification of a Speaker

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

246 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Energy-Efficient Unobtrusive Identification of a Speaker

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

246 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links