VOICE-BODY IDENTITY CORRELATION

US 20110182481A1
Filed: 01/25/2010
Published: 07/28/2011
Est. Priority Date: 01/25/2010
Status: Active Grant

First Claim

Patent Images

1. In a multi-user application starting with an unknown set of users, a method of identifying a correlation between a user and user voice, the method comprising the steps of:

(a) receiving a plurality of images of objects within a field of view of a video capture component taken over a plurality of time periods;

(b) determining whether the images received in said step (a) include one or more users;

(c) receiving audio within the range of a microphone array for a plurality of time periods;

(d) determining whether the audio received in said step (c) includes one or more human voices; and

(e) correlating a voice identified in said step (d) to a user of the one or more users within the field of view based on a plurality of samplings of determined positions of the user in different images and determined source locations of the voice at different times.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system and method are disclosed for tracking image and audio data over time to automatically identify a person based on a correlation of their voice with their body in a multi-user game or multimedia setting.

Citations

20 Claims

1. In a multi-user application starting with an unknown set of users, a method of identifying a correlation between a user and user voice, the method comprising the steps of:
- (a) receiving a plurality of images of objects within a field of view of a video capture component taken over a plurality of time periods;
  
  (b) determining whether the images received in said step (a) include one or more users;
  
  (c) receiving audio within the range of a microphone array for a plurality of time periods;
  
  (d) determining whether the audio received in said step (c) includes one or more human voices; and
  
  (e) correlating a voice identified in said step (d) to a user of the one or more users within the field of view based on a plurality of samplings of determined positions of the user in different images and determined source locations of the voice at different times.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The method recited in claim 1, said step (e) comprising the step of a sampling of the plurality of samplings being formed by determining a location of the one or more users from examination of an image of the plurality of images and being formed by determining a location of the voice using an acoustic source localization technique.
  - 3. The method recited in claim 1, said step (e) comprising the step of performing a first sampling of the plurality of samplings to obtain a confidence level in an association between the voice and the user, a confidence level above a predefined threshold resulting in the voice and the user being associated together in memory.
  - 4. The method recited in claim 3, said step (e) comprising the step of the confidence level going up in subsequent samplings of the plurality of samplings if the subsequent samplings decrease the number of possible users to whom the voice can belong to.
  - 5. The method recited in claim 4, further comprising the step of unambiguously correlating the voice to a user upon eliminating all other users to whom the voice could belong to in the plurality of samplings.
  - 6. The method recited in claim 5, further comprising the step of performing additional samplings in the plurality of samplings after the correlation between the voice and user has been unambiguously associated together.
  - 7. The method recited in claim 3, further comprising the step of removing the correlation if the additional samplings are unable to remove an ambiguity with respect to which user the voice belongs or if the additional samplings show the voice belongs to a second user of the one or more users.
  - 8. The method recited in claim 1, said step (e) comprising the step of performing a first sampling of the plurality of samplings to derive a scored confidence level of an association between the voice and a user, the scored confidence level obtained by examining one or more of the following factors:
    - i. how close the estimated position of the voice source is to the one or more users;
      
      ii. the number of voices which are being heard;
      
      iii. the closeness of the one or more users to an estimated source of the voice;
      
      iv. whether the source of the voice is estimated to be centered within a field of view of the image or closer to edges of the field of view.
  - 9. The method recited in claim 1, said step (b) of determining whether the images received in said step (a) include one or more users comprising the step of measuring locations of at least portions of the users skeletal joints
  - 10. The method of claim 9, said step (e) of correlating a voice identified in said step (d) to a user based in part on a determined source locations of the voice comprising the step of determining source locations of a voice by time difference of arrivals.

11. In a multi-user application where correlation of a voice to a user may require more than a single sampling of the voice and user location, a method of identifying a correlation between a user and user voice, the method comprising the steps of:
- (a) receiving a plurality of images of objects within a field of view of a video capture component taken over a plurality of time periods;
  
  (b) determining whether the images received in said step (a) include one or more users;
  
  (c) receiving audio within the range of a microphone array for a plurality of time periods covering the plurality of images;
  
  (d) determining whether the audio received in said step (c) includes one or more human voices;
  
  (e) performing an initial sampling examining a location of one or more users with respect to an image capture component and a location of a voice with respect to an audio capture component, the initial sampling determining the voice is correlated to a user of the one or more users above a threshold confidence level; and
  
  (f) performing additional samplings examining locations of the one or more users with respect to the image capture component and locations of the voice with respect to the audio capture component, the additional samplings confirming the correlation of the voice with the user or the additional samplings reducing a likelihood that the voice is correlated to the user.
- View Dependent Claims (12, 13, 14, 15, 16)
- - 12. The method of claim 11, wherein said step (e) further comprises the step of examining physical traits of the user to distinguish the user from other users and examining acoustic qualities of the voice to distinguish the voice from other voices.
  - 13. The method of claim 11, wherein said steps (e) and (f) unambiguously correlate the voice to the user by identifying a correlation between a location of the user and a source of the voice and by eliminating all other users as possible sources of the voice.
  - 14. The method of claim 12, further comprising the step of performing additional samplings to reaffirm the correlation of the voice to the user after the voice has been unambiguously correlated to the user.
  - 15. The method of claim 11, further comprising the step of removing the correlation between the voice and user determined in said step (e) where the additional samplings in said (f) are unable to disambiguate the correlation between the voice and the user.
  - 16. The method of claim 11, wherein said step (e) determines whether the voice is correlated to a user of the one or more users above a threshold confidence level by examining one or more of the following factors:
    - i. how close the estimated position of the voice source is to the one or more users;
      
      ii. the number of voices which are being heard;
      
      iii. the closeness of the one or more users to an estimated source of the voice;
      
      iv. whether the source of the voice is estimated to be centered within a field of view of the image or closer to edges of the field of view.

17. A system for correlating a voice to user in a multi-user application, the system comprising:
- an image camera component capable of providing a depth image of one or more users in a field of view of the image camera component;
  
  a microphone array capable of receiving audio within range of the microphone array, the microphone array capable of localizing a source of a voice to within a first tolerance; and
  
  a computing environment in communication with both the image capture component and microphone array, the computing environment capable of distinguishing between different users in the field of view to a second tolerance, the first and second tolerances at times preventing correlation of the voice to a user of the one or more users after an initial sampling of data from the image camera and data from the microphone array, the computing environment further performing additional samplings of data from the image camera and data from the microphone array, the additional samplings allowing the correlation of the voice with the user or the additional samplings reducing a likelihood that the voice is correlated to the user.
- View Dependent Claims (18, 19, 20)
- - 18. The system of claim 17, wherein the computing environment executes a gaming application involving the one or more users while performing the initial and additional samplings.
  - 19. The system of claim 17, wherein the computing environment distinguishes between different users in the field of view by detecting locations of joints of the one or more users.
  - 20. The system of claim 19, wherein the microphone array uses two microphones to localize a source of the voice by time difference of arrivals of the voice to the two microphones.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Dernis, Mitchell, Klein, Christian, Leyvand, Tommer, Li, Jinyu

Granted Patent

US 8,265,341 B2
Time in Patent Office

Days
Field of Search
US Class Current

382/116
CPC Class Codes

A63F 13/213   comprising photodetecting m...

A63F 13/215   comprising means for detect...

A63F 13/79   involving player-related da...

G06F 16/7834   using audio features

G06V 40/70   Multimodal biometrics, e.g....

G10L 17/10   Multimodal systems, i.e. ba...

G10L 2021/02166   Microphone arrays; Beamforming

VOICE-BODY IDENTITY CORRELATION

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

VOICE-BODY IDENTITY CORRELATION

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links