Systems and methods for manipulating electronic content based on speech recognition

US 10,032,465 B2
Filed: 03/01/2016
Issued: 07/24/2018
Est. Priority Date: 06/10/2010
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method comprising the following operations performed by at least one processor:

extracting an audio track from an electronic media content;

detecting, based on a speech model, a speaker segment within the extracted audio track;

determining, by the processor, a first probability of the detected speaker segment being associated with an individual speaker by using both a speaker speech model and a non-speaker speech model, wherein the speaker speech model represents an individual speaker and the non-speaker speech model represents common characteristics from one or more speakers;

determining a first ranking value of the electronic media content relative to other electronic media content based on the first probability of the detected speaker segment and probabilities for detected speaker segments within the other electronic media content;

receiving a search query from a user;

determining a second ranking value of the electronic media content based on relevancy between the query and the individual speaker; and

determining a final ranking value of the electronic media content based on the first ranking value and the second ranking value.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems and methods are disclosed for displaying electronic multimedia content to a user. One computer-implemented method for manipulating electronic multimedia content includes generating, using a processor, a speech model and at least one speaker model of an individual speaker. The method further includes receiving electronic media content over a network; extracting an audio track from the electronic media content; and detecting speech segments within the electronic media content based on the speech model. The method further includes detecting a speaker segment within the electronic media content and calculating a probability of the detected speaker segment involving the individual speaker based on the at least one speaker model.

Citations

20 Claims

1. A computer-implemented method comprising the following operations performed by at least one processor:
- extracting an audio track from an electronic media content;
  
  detecting, based on a speech model, a speaker segment within the extracted audio track;
  
  determining, by the processor, a first probability of the detected speaker segment being associated with an individual speaker by using both a speaker speech model and a non-speaker speech model, wherein the speaker speech model represents an individual speaker and the non-speaker speech model represents common characteristics from one or more speakers;
  
  determining a first ranking value of the electronic media content relative to other electronic media content based on the first probability of the detected speaker segment and probabilities for detected speaker segments within the other electronic media content;
  
  receiving a search query from a user;
  
  determining a second ranking value of the electronic media content based on relevancy between the query and the individual speaker; and
  
  determining a final ranking value of the electronic media content based on the first ranking value and the second ranking value.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The computer-implemented method of claim 1, further comprising:
    - determining whether the final ranking value of the electronic media content equals to or exceeds a threshold; and
      
      presenting the electronic media content to the user if the final ranking value of the electronic media content equals to or exceeds the threshold.
  - 3. The computer-implemented method of claim 1, wherein detecting a speaker segment within the extracted audio track is based on a speech model and a non-speech model.
  - 4. The computer-implemented method of claim 1, further comprising:
    - presenting the electronic media content to the user based on the final ranking.
  - 5. The computer-implemented method of claim 1, further comprising:
    - determining, based on a detected face, a second probability of the detected speaker segment being associated with an individual speaker; and
      
      adjusting the first ranking value of the electronic media content based on the second probability of the detected speaker segment.
  - 6. The computer-implemented method of claim 1, further comprising:
    - generating at least one speaker speech model of an individual speaker, and a non-speaker speech model that includes common characteristics from one or more speakers.
  - 7. The computer-implemented method of claim 1, further comprising:
    - generating a plurality of speaker speech models for a subset of people, each speaker speech model corresponding to one person in the subset of people; and
      
      calculating a probability of the speaker segment involving one of the people in the subset of people, based on the plurality of speaker speech models.
  - 8. The computer-implemented method of claim 1, further comprising:
    - detecting duplicate videos associated with the electronic media content.
  - 9. The computer-implemented method of claim 1, further comprising:
    - extracting preview clips from the electronic media content; and
      
      presenting the extracted preview clips to the user.
  - 10. The computer-implemented method of claim 1, further comprising:
    - detecting boundaries among consecutive visual scenes within the electronic media content; and
      
      extracting, based on the detected boundaries and detected speaker segment, a preview clip from the electronic media content.

11. A system, comprising:
- at least one processor; and
  
  a memory storing executable instructions that, when executed by the at least one processor, causes the at least one processor to perform the following operations;
  
  extracting an audio track from an electronic media content;
  
  detecting, based on a speech model, a speaker segment within the extracted audio track;
  
  determining, by the processor, a first probability of the detected speaker segment being associated with an individual speaker by using both a speaker speech model and a non-speaker speech model, wherein the speaker speech model represents an individual speaker and the non-speaker speech model represents common characteristics from one or more speakers;
  
  determining a first ranking value of the electronic media content relative to other electronic media content based on the first probability of the detected speaker segment and probabilities for detected speaker segments within the other electronic media content;
  
  receiving a search query from a user;
  
  determining a second ranking value of the electronic media content based on relevancy between the query and the individual speaker; and
  
  determining a final ranking value of the electronic media content based on the first ranking value and the second ranking value.
- View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19)
- - 12. The system of claim 11, further comprising:
    - determining whether the final ranking value of the electronic media content equals to or exceeds a threshold; and
      
      presenting the electronic media content to the user if the final ranking value of the electronic media content equals to or exceeds the threshold.
  - 13. The system of claim 11, wherein detecting a speaker segment within the extracted audio track is based on a speech model and a non-speech model.
  - 14. The system of claim 11, further comprising:
    - presenting the electronic media content to the user based on the final ranking.
  - 15. The system of claim 11, further comprising:
    - determining, based on a detected face, a second probability of the detected speaker segment being associated with an individual speaker; and
      
      adjusting the first ranking value of the electronic media content based on the second probability of the detected speaker segment.
  - 16. The system of claim 11, further comprising:
    - generating at least one speaker speech model of an individual speaker, and a non-speaker speech model that includes common characteristics from one or more speakers.
  - 17. The system of claim 11, further comprising:
    - generating a plurality of speaker speech models for a subset of people, each speaker speech model corresponding to one person in the subset of people; and
      
      calculating a probability of the speaker segment involving one of the people in the subset of people, based on the plurality of speaker speech models.
  - 18. The system of claim 11, further comprising:
    - detecting, based on the speaker segments, duplicate videos among the electronic media content.
  - 19. The system of claim 11, further comprising:
    - extracting preview clips from the electronic media content; and
      
      presenting the extracted preview clips to the user.

20. A tangible, non-transitory computer-readable medium storing instructions that, when executed by at least one processor, cause the at least one processor to perform operations comprising:
- extracting an audio track from an electronic media content;
  
  detecting, based on a speech model, a speaker segment within the extracted audio track;
  
  determining, by the processor, a first probability of the detected speaker segment being associated with an individual speaker by using both a speaker speech model and a non-speaker speech model, wherein the speaker speech model represents an individual speaker and the non-speaker speech model represents common characteristics from one or more speakers;
  
  determining a first ranking value of the electronic media content relative to other electronic media content based on the first probability of the detected speaker segment and probabilities for detected speaker segments within the other electronic media content;
  
  receiving a search query from a user;
  
  determining a second ranking value of the electronic media content based on relevancy between the query and the individual speaker; and
  
  determining a final ranking value of the electronic media content based on the first ranking value and the second ranking value.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Verizon Patent and Licensing Incorporated (Verizon Communications Inc.)
Original Assignee
Oath Inc. (Verizon Communications Inc.)
Inventors
Kocks, Peter F., Hu, Guoning, Wu, Ping-Hao
Primary Examiner(s)
BAKER, MATTHEW H

Application Number

US15/057,414
Publication Number

US 20160182957A1
Time in Patent Office

875 Days
Field of Search
US Class Current
CPC Class Codes

G06F 16/433   using audio data

G06F 16/7834   using audio features

G06F 16/784   the detected or recognised ...

G10L 15/06   Creation of reference templ...

G10L 15/08   Speech classification or se...

G10L 17/00   Speaker identification or v...

G10L 25/57   for processing of video sig...

H04N 21/4394   involving operations for an...

H04N 21/4668   for recommending content, e...

Systems and methods for manipulating electronic content based on speech recognition

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Systems and methods for manipulating electronic content based on speech recognition

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links