Using audio characteristics to identify speakers and media items
First Claim
Patent Images
1. A method performed by one or more computers, the method comprising:
- receiving, by the one or more computers, a request from a client device for media content, the request including at least a portion of a first media item or a URL corresponding to the first media item, the first media item including speech of a person;
based on the data indicating the first media item, selecting, by the one or more computers, one or more other media items based on one or more representations of acoustic characteristics of the one or more other media items,wherein the one or more representations of acoustic characteristics of the one or more other media items comprise, for each of the one or more other media items, a speaker representation that includes (i) an i-vector or d-vector generated from the other media item, or (ii) a hash of an i-vector or d-vector generated from the other media item;
wherein each of the one or more other media items is selected based on a comparison of (i) an i-vector, d-vector or hash determined from speech in the first media item with (ii) the speaker representation for the other media item,wherein;
each of the selected one or more other media items is different from the first media item;
each of the selected one or more other media items includes speech of the same person whose speech is included in the first media item; and
each of the selected one or more other media items is determined, based on the acoustic characteristics of the media item, to include speech demonstrating speaker characteristics that have at least a threshold level of similarity with speaker characteristics determined from speech in the first media item;
generating, by the one or more computers, data indicating the selected one or more other media items that are each different from the first media item and that each include speech of the same person whose speech is included in the first media item; and
providing, by the one or more computers and to the client device, a response to the request that includes the data indicating the selected one or more other media items that are each different from the first media item and that each include speech of the same person whose speech is included in the first media item.
2 Assignments
0 Petitions
Accused Products
Abstract
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for performing speaker identification. In some implementations, data identifying a media item including speech of a speaker is received. Based on the received data, one or more other media items that include speech of the speaker are identified. One or more search results are generated that each reference a respective media item of the one or more other media items that include speech of the speaker. The one or more search results are provided for display.
-
Citations
19 Claims
-
1. A method performed by one or more computers, the method comprising:
-
receiving, by the one or more computers, a request from a client device for media content, the request including at least a portion of a first media item or a URL corresponding to the first media item, the first media item including speech of a person; based on the data indicating the first media item, selecting, by the one or more computers, one or more other media items based on one or more representations of acoustic characteristics of the one or more other media items, wherein the one or more representations of acoustic characteristics of the one or more other media items comprise, for each of the one or more other media items, a speaker representation that includes (i) an i-vector or d-vector generated from the other media item, or (ii) a hash of an i-vector or d-vector generated from the other media item; wherein each of the one or more other media items is selected based on a comparison of (i) an i-vector, d-vector or hash determined from speech in the first media item with (ii) the speaker representation for the other media item, wherein; each of the selected one or more other media items is different from the first media item; each of the selected one or more other media items includes speech of the same person whose speech is included in the first media item; and each of the selected one or more other media items is determined, based on the acoustic characteristics of the media item, to include speech demonstrating speaker characteristics that have at least a threshold level of similarity with speaker characteristics determined from speech in the first media item; generating, by the one or more computers, data indicating the selected one or more other media items that are each different from the first media item and that each include speech of the same person whose speech is included in the first media item; and providing, by the one or more computers and to the client device, a response to the request that includes the data indicating the selected one or more other media items that are each different from the first media item and that each include speech of the same person whose speech is included in the first media item. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A system comprising:
-
one or more computers and one or more storage devices storing instructions that are operable;
when executed by the one or more computers, to cause the one or more computers to perform operations comprising;receiving, by the one or more computers, a request from a client device for media content, the request including at least a portion of a first media item or a URL corresponding to the first media item, the first media item including speech of a person; based on the data indicating the first media item, selecting, by the one or more computers, one or more other media items based on one or more representations of acoustic characteristics of the one or more other media items, wherein the one or more representations of acoustic characteristics of the one or more other media items corn rise for each of the one or more other media items a media item, or (ii) a hash of an i-vector or d-vector generated from the other media item; wherein each of the one or more other media items is selected based on a comparison of (i) an i-vector, d-vector or hash determined from speech in the first media item with (ii) the speaker representation for the other media item, wherein; each of the selected one or more other media items is different from the first media item; each of the selected one or more other media items includes speech of the same person whose speech is included in the first media item; and each of the selected one or more other media items is determined, based on acoustic characteristics of the media item, to include speech demonstrating speaker characteristics that have at least a threshold level of similarity with speaker characteristics determined from speech in the first media item; generating, by the one or more computers, data indicating the selected one or more other media items that are each different from the first media item and that each include speech of the same person whose speech is included in the first media item; and providing, by the one or more computers and to the client device, a response to the request that includes the data indicating the selected one or more other media items that are each different from the first media item and that each include speech of the same person whose speech is included in the first media item. - View Dependent Claims (10, 11, 12, 13, 14, 15)
-
-
16. One or more non-transitory computer-readable media storing software comprising instructions executable by one or more computers which, upon such execution, cause the one or more computers to perform operations comprising:
-
receiving, by the one or more computers, a request from a client device for media content, the request including at least a portion of a first media item or a URL corresponding to the first media item, the first media item including speech of a person; based on the data indicating the first media item, selecting, by the one or more computers, one or more other media items based on one or more representations of acoustic characteristics of the one or more other media items, wherein the one or more representations of acoustic characteristics of the one or more other media items comprise, for each of the one or more other media items, a media item, or (ii) a hash of an i-vector or d-vector generated from the other media item; wherein each of the one or more other, media items is selected based on a comparison of (i) an i-vector, d-vector or hash determined from speech in the first media item with (ii) the speaker representation for the other media item, wherein; each of the selected one or more other media items is different from the first media item; each of the selected one or more other media items includes speech of the same person whose speech is included in the first media item; and each of the selected one or more other media items is determined, based on the acoustic characteristics of the media item, to include speech demonstrating speaker characteristics that have at least a threshold level of similarity with speaker characteristics determined from speech in the first media item; generating, by the one or more computers, data indicating the selected one or more other media items that are each different from the first media item and that each include speech of the same person whose speech is included in the first media item; and providing, by the one or more computers and to the client device, a response to the request that includes the data indicating the selected one or more other media items that are each different from the first media item and that each include speech of the same person whose speech is included in the first media item. - View Dependent Claims (17, 18, 19)
-
Specification