Voice tracking camera with speaker identification
First Claim
1. An automated videoconferencing method for a near-end endpoint located in a near-end environment at a near-end site to conduct a videoconference with one or more far-end endpoints located in one or more far-end environments at one or more far-end sites, the method comprising:
- detecting one or more first audio indicative of speech during the videoconference with the near-end endpoint in the near-end environment at the near-end site;
determining one or more first speech frequency characteristics of the one or more first audio, the one or more first speech frequency characteristics characterizing frequency information indicative of the speech detected in the one or more first audio;
determining one or more first source locations of the one or more first audio with the near-end endpoint in the near-end environment at the near-end site;
directing at least one camera of the near-end endpoint at at least one of the one or more first source locations in the near-end environment;
storing the one or more first speech frequency characteristics with information of at least one camera directed at the at least one of the one or more first source locations;
detecting second audio indicative of speech with the near-end endpoint in the near-end environment at the near-end site;
determining a second speech frequency characteristic of the second audio, the second speech frequency characteristic characterizing frequency information indicative of the speech detected in the second audio; and
determining that the second speech frequency characteristic does not match the stored one or more first speech frequency characteristics;
determining a second source location of the second audio with the near-end endpoint in the near-end environment at the near-end site;
directing the at least one camera of the near-end endpoint at the second source location in the near-end environment; and
storing the second speech frequency characteristic with information of the at least one camera directed at the second source location.
10 Assignments
0 Petitions
Accused Products
Abstract
A videoconferencing apparatus automatically tracks speakers in a room and dynamically switches between a controlled, people-view camera and a fixed, room-view camera. When no one is speaking, the apparatus shows the room view to the far-end. When there is a dominant speaker in the room, the apparatus directs the people-view camera at the dominant speaker and switches from the room-view camera to the people-view camera. When there is a new speaker in the room, the apparatus switches to the room-view camera first, directs the people-view camera at the new speaker, and then switches to the people-view camera directed at the new speaker. When there are two near-end speakers engaged in a conversation, the apparatus tracks and zooms-in the people-view camera so that both speakers are in view.
46 Citations
22 Claims
-
1. An automated videoconferencing method for a near-end endpoint located in a near-end environment at a near-end site to conduct a videoconference with one or more far-end endpoints located in one or more far-end environments at one or more far-end sites, the method comprising:
-
detecting one or more first audio indicative of speech during the videoconference with the near-end endpoint in the near-end environment at the near-end site; determining one or more first speech frequency characteristics of the one or more first audio, the one or more first speech frequency characteristics characterizing frequency information indicative of the speech detected in the one or more first audio; determining one or more first source locations of the one or more first audio with the near-end endpoint in the near-end environment at the near-end site; directing at least one camera of the near-end endpoint at at least one of the one or more first source locations in the near-end environment; storing the one or more first speech frequency characteristics with information of at least one camera directed at the at least one of the one or more first source locations; detecting second audio indicative of speech with the near-end endpoint in the near-end environment at the near-end site; determining a second speech frequency characteristic of the second audio, the second speech frequency characteristic characterizing frequency information indicative of the speech detected in the second audio; and determining that the second speech frequency characteristic does not match the stored one or more first speech frequency characteristics; determining a second source location of the second audio with the near-end endpoint in the near-end environment at the near-end site; directing the at least one camera of the near-end endpoint at the second source location in the near-end environment; and storing the second speech frequency characteristic with information of the at least one camera directed at the second source location. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)
-
-
21. A non-transitory program storage device having program instructions stored thereon for causing a programmable control device to perform an automated videoconferencing method for a near-end endpoint located in a near-end environment at a near-end site conducting a videoconference with one or more far-end endpoints located in one or more far-end environments at one or more far-end sites, the method comprising:
-
detecting one or more first audio indicative of speech during the videoconference with the near-end endpoint in the near-end environment at the near-end site; determining one or more first speech frequency characteristics of the one or more first audio, the one or more first speech frequency characteristics characterizing frequency information indicative of the speech detected in the one or more first audio; determining one or more first source locations of the one or more first audio with the near-end endpoint in the near-end environment at the near-end site; directing at least one camera of the near-end endpoint at at least one of the one or more first source locations in the near-end environment; storing the one or more first speech frequency characteristics with information of the at least one camera directed at the at least one of the one or more first source locations; detecting second audio indicative of speech with the near-end endpoint in the near-end environment at the near-end site; determining a second speech frequency characteristic of the second audio, the second speech frequency characteristic characterizing frequency information indicative of the speech detected in the second audio; and determining that the second speech frequency characteristic does not match the stored one or more first speech frequency characteristics; determining a second source location of the second audio with the near-end endpoint in the near-end environment at the near-end site; directing the at least one camera of the near-end endpoint at the second source location in the near-end environment; and storing the second speech frequency characteristic with information of the at least one camera directed at the second source location.
-
-
22. A videoconferencing apparatus located in a near-end environment at a near-end site and operatively coupling in a videoconference with one or more far-end endpoints located in one or more far-end environments at one or more far-end sites, the apparatus comprising:
-
at least one camera being controllable to pan, tilt, and zoom in the near-end environment at the near-end site; one or more microphones for capturing near-end audio in the near-end environment at the near-end site; memory storing speech frequency characteristics for near-end participants located in the near-end environment at the near-end site and storing source locations in the near-end environment of the near participants associated with the stored speech frequency characteristics; and a processing unit operatively coupled to the at least one camera, the one or more microphones, and the memory, the processing unit being operable to; detect one or more first audio indicative of speech during the videoconference with the near-end endpoint in the near-end environment at the near-end site; determine one or more first speech frequency characteristics of the one or more first audio, the one or more first speech frequency characteristics characterizing frequency information indicative of the speech detected in the one or more first audio; determine one or more first source locations of the one or more first audio with the near-end endpoint in the near-end environment at the near-end site; direct at least one camera of the near-end endpoint at at least one of the one or more first source locations in the near-end environment; store the one or more first speech frequency characteristics with information of the at least one camera directed at the at least one of the one or more first source locations; detect second audio indicative of speech with the near-end endpoint in the near-end environment at the near-end site; determine a second speech frequency characteristic of the second audio, the second speech frequency characteristic characterizing frequency information indicative of the speech detected in the second audio; and determine that the second speech frequency characteristic does not match the stored one or more first speech frequency characteristics; determine a second source location of the second audio with the near-end endpoint in the near-end environment at the near-end site; direct the at least one camera of the near-end endpoint at the second source location in the near-end environment; and store the second speech frequency characteristic with information of the at least one camera directed at the second source location.
-
Specification