Using audio characteristics to identify speakers and media items

US 10,140,991 B2
Filed: 06/16/2017
Issued: 11/27/2018
Est. Priority Date: 11/04/2013
Status: Active Grant

First Claim

Patent Images

1. A method performed by one or more computers, the method comprising:

receiving, by the one or more computers, a request from a client device for media content, the request including at least a portion of a first media item or a URL corresponding to the first media item, the first media item including speech of a person;

based on the data indicating the first media item, selecting, by the one or more computers, one or more other media items based on one or more representations of acoustic characteristics of the one or more other media items,wherein the one or more representations of acoustic characteristics of the one or more other media items comprise, for each of the one or more other media items, a speaker representation that includes (i) an i-vector or d-vector generated from the other media item, or (ii) a hash of an i-vector or d-vector generated from the other media item;

wherein each of the one or more other media items is selected based on a comparison of (i) an i-vector, d-vector or hash determined from speech in the first media item with (ii) the speaker representation for the other media item,wherein;

each of the selected one or more other media items is different from the first media item;

each of the selected one or more other media items includes speech of the same person whose speech is included in the first media item; and

each of the selected one or more other media items is determined, based on the acoustic characteristics of the media item, to include speech demonstrating speaker characteristics that have at least a threshold level of similarity with speaker characteristics determined from speech in the first media item;

generating, by the one or more computers, data indicating the selected one or more other media items that are each different from the first media item and that each include speech of the same person whose speech is included in the first media item; and

providing, by the one or more computers and to the client device, a response to the request that includes the data indicating the selected one or more other media items that are each different from the first media item and that each include speech of the same person whose speech is included in the first media item.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for performing speaker identification. In some implementations, data identifying a media item including speech of a speaker is received. Based on the received data, one or more other media items that include speech of the speaker are identified. One or more search results are generated that each reference a respective media item of the one or more other media items that include speech of the speaker. The one or more search results are provided for display.

Citations

19 Claims

1. A method performed by one or more computers, the method comprising:
- receiving, by the one or more computers, a request from a client device for media content, the request including at least a portion of a first media item or a URL corresponding to the first media item, the first media item including speech of a person;
  
  based on the data indicating the first media item, selecting, by the one or more computers, one or more other media items based on one or more representations of acoustic characteristics of the one or more other media items,wherein the one or more representations of acoustic characteristics of the one or more other media items comprise, for each of the one or more other media items, a speaker representation that includes (i) an i-vector or d-vector generated from the other media item, or (ii) a hash of an i-vector or d-vector generated from the other media item;
  
  wherein each of the one or more other media items is selected based on a comparison of (i) an i-vector, d-vector or hash determined from speech in the first media item with (ii) the speaker representation for the other media item,wherein;
  
  each of the selected one or more other media items is different from the first media item;
  
  each of the selected one or more other media items includes speech of the same person whose speech is included in the first media item; and
  
  each of the selected one or more other media items is determined, based on the acoustic characteristics of the media item, to include speech demonstrating speaker characteristics that have at least a threshold level of similarity with speaker characteristics determined from speech in the first media item;
  
  generating, by the one or more computers, data indicating the selected one or more other media items that are each different from the first media item and that each include speech of the same person whose speech is included in the first media item; and
  
  providing, by the one or more computers and to the client device, a response to the request that includes the data indicating the selected one or more other media items that are each different from the first media item and that each include speech of the same person whose speech is included in the first media item.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method of claim 1, wherein receiving the request comprises receiving a request that includes a URL that corresponds to (i) a video that includes speech of the person, or (ii) an audio recording that includes speech of the person.
  - 3. The method of claim 1, wherein receiving the request comprises receiving the first media item from the client device over a communication network, the received first media item comprising (i) video data that includes speech of the person, or (ii) audio data that includes speech of the person.
  - 4. The method of claim 1, further comprising providing, for display with one or more search results indicating the one or more other media items, a name of the person whose speech is included in the first media item, wherein the name of the person is determined based on determining that a representation of speaker characteristics for the first media item has at least a threshold level of similarity with representations of speaker characteristics for the one or more other media items.
  - 5. The method of claim 1, wherein the first media item includes speech of multiple people;
    - wherein generating the data indicating the selected one or more other media items comprises generating data indicating one or more media items that each include speech of each of the multiple people; and
      
      wherein providing the response comprises providing a response that includes the data indicating the one or more media items that each include speech of each of the multiple people.
  - 6. The method of claim 1, wherein generating the data indicating the selected one or more other media items comprises generating one or more search results that each include a link to a media item that is available on the Internet and that includes speech of the person.
  - 7. The method of claim 1, wherein selecting the one or more other media items comprises selecting the one or more other media items based on utterance characteristics, for utterances in the other media items and the first media item, that are independent of the specific words and sounds in the utterances.
  - 8. The method of claim 7, wherein selecting the one or more other media items comprises selecting the one or more other media items based on characteristics of utterances in the one or more other media items representing the speaker'"'"'s speaking style, the speaker'"'"'s gender, the speaker'"'"'s age, the speaker'"'"'s language, or the speaker'"'"'s accent.

9. A system comprising:
- one or more computers and one or more storage devices storing instructions that are operable;
  
  when executed by the one or more computers, to cause the one or more computers to perform operations comprising;
  
  receiving, by the one or more computers, a request from a client device for media content, the request including at least a portion of a first media item or a URL corresponding to the first media item, the first media item including speech of a person;
  
  based on the data indicating the first media item, selecting, by the one or more computers, one or more other media items based on one or more representations of acoustic characteristics of the one or more other media items,wherein the one or more representations of acoustic characteristics of the one or more other media items corn rise for each of the one or more other media items a media item, or (ii) a hash of an i-vector or d-vector generated from the other media item;
  
  wherein each of the one or more other media items is selected based on a comparison of (i) an i-vector, d-vector or hash determined from speech in the first media item with (ii) the speaker representation for the other media item,wherein;
  
  each of the selected one or more other media items is different from the first media item;
  
  each of the selected one or more other media items includes speech of the same person whose speech is included in the first media item; and
  
  each of the selected one or more other media items is determined, based on acoustic characteristics of the media item, to include speech demonstrating speaker characteristics that have at least a threshold level of similarity with speaker characteristics determined from speech in the first media item;
  
  generating, by the one or more computers, data indicating the selected one or more other media items that are each different from the first media item and that each include speech of the same person whose speech is included in the first media item; and
  
  providing, by the one or more computers and to the client device, a response to the request that includes the data indicating the selected one or more other media items that are each different from the first media item and that each include speech of the same person whose speech is included in the first media item.
- View Dependent Claims (10, 11, 12, 13, 14, 15)
- - 10. The system of claim 9, wherein receiving the request comprises receiving a request that includes a URL that corresponds to (i) a video that includes speech of the person, or (ii) an audio recording that includes speech of the person.
  - 11. The system of claim 9, wherein receiving the request comprises receiving the first media item from the client device over a communication network, the received first media item comprising (i) video data that includes speech of the person, or (ii) audio data that includes speech of the person.
  - 12. The system of claim 9, wherein the operations further comprise providing, for display with one or more search results indicating the one or more other media items, a name of the person whose speech is included in the first media item.
  - 13. The system of claim 12, wherein the operations further comprise determining the name of the person based on comparison of speech characteristics determined from speech in the first media item with speech characteristics determined from speech in the one or more other media items that include speech of the person.
  - 14. The system of claim 9, wherein providing the response to the request further comprises:
    - providing, by the one or more computers and to the client device, data indicating an identity of the person whose speech is included in the first media item.
  - 15. The system of claim 9, wherein generating the data indicating the selected one or more other media items comprises generating one or more search results that each include a link to a media item that is available on the Internet and that includes speech of the person.

16. One or more non-transitory computer-readable media storing software comprising instructions executable by one or more computers which, upon such execution, cause the one or more computers to perform operations comprising:
- receiving, by the one or more computers, a request from a client device for media content, the request including at least a portion of a first media item or a URL corresponding to the first media item, the first media item including speech of a person;
  
  based on the data indicating the first media item, selecting, by the one or more computers, one or more other media items based on one or more representations of acoustic characteristics of the one or more other media items,wherein the one or more representations of acoustic characteristics of the one or more other media items comprise, for each of the one or more other media items, a media item, or (ii) a hash of an i-vector or d-vector generated from the other media item;
  
  wherein each of the one or more other, media items is selected based on a comparison of (i) an i-vector, d-vector or hash determined from speech in the first media item with (ii) the speaker representation for the other media item,wherein;
  
  each of the selected one or more other media items is different from the first media item;
  
  each of the selected one or more other media items includes speech of the same person whose speech is included in the first media item; and
  
  each of the selected one or more other media items is determined, based on the acoustic characteristics of the media item, to include speech demonstrating speaker characteristics that have at least a threshold level of similarity with speaker characteristics determined from speech in the first media item;
  
  generating, by the one or more computers, data indicating the selected one or more other media items that are each different from the first media item and that each include speech of the same person whose speech is included in the first media item; and
  
  providing, by the one or more computers and to the client device, a response to the request that includes the data indicating the selected one or more other media items that are each different from the first media item and that each include speech of the same person whose speech is included in the first media item.
- View Dependent Claims (17, 18, 19)
- - 17. The one or more non-transitory computer-readable media of claim 16, wherein receiving the request comprises receiving a request that includes a URL that corresponds to (i) a video that includes speech of the person, or (ii) an audio recording that includes speech of the person.
  - 18. The one or more non-transitory computer-readable media of claim 16, wherein receiving the request comprises receiving the first media item from the client device over a communication network, the received first media item comprising (i) video data that includes speech of the person, or (ii) audio data that includes speech of the person.
  - 19. The method of claim 16, wherein the first media item is a video that includes speech of the person, and wherein selecting the one or more other media items comprises selecting one or more other videos that include speech of the person.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google LLC (Alphabet Inc.)
Inventors
Sharifi, Matthew, Lopez Moreno, Ignacio, Schmidt, Ludwig
Primary Examiner(s)
Roberts, Shaun

Application Number

US15/624,760
Publication Number

US 20170287487A1
Time in Patent Office

529 Days
Field of Search

704246, 704270, 725 53
US Class Current
CPC Class Codes

G10L 17/00   Speaker identification or v...

G10L 17/02   Preprocessing operations, e...

G10L 17/08   Use of distortion metrics o...

G10L 17/18   Artificial neural networks;...

G10L 25/51   for comparison or discrimin...

Using audio characteristics to identify speakers and media items

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

19 Claims

Specification

Solutions

Use Cases

Quick Links

Using audio characteristics to identify speakers and media items

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

19 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links