Systems and methods for transcribing videos using speaker identification

US 10,304,458 B1
Filed: 03/06/2015
Issued: 05/28/2019
Est. Priority Date: 03/06/2014
Status: Active Grant

First Claim

Patent Images

1. A method for transcribing video, comprising:

receiving, by a computer-based system from a network, a video feed having video data, audio data, and closed-captioning data, the closed-captioning data indicative of speech defined by the audio data;

identifying a speech segment within the closed-captioning data;

automatically defining, by the computer-based system, a transcript of the speech based on the closed-captioning data in the video feed received from the network;

automatically analyzing, by the computer-based system, the video data, audio data, and closed-captioning data by the computer-based system;

automatically identifying, by the computer-based system, a speaker for the speech segment within the transcript based on the analyzing;

automatically marking, by the computer-based system, the speech segment in the transcript with an identifier of the speaker thereby attributing the speech segment to the speaker;

summarizing the transcript, wherein the summarizing comprises determining an overall percentage of speech attributable to the speaker for the closed-captioning data and selecting portions of the transcript for removal based on the determined overall percentage; and

storing the transcript in memory.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system and method for summarizing video feeds correlated to identified speakers. A transcriber system includes multiple types of reasoning logic for identifying specific speakers contained in the video feed. Each type of reasoning logic is stored in memory and may be combined and configurable to provide an aggregated speaker identification result useful for full or summarized transcription before transmission across a network for display on a network-accessible device.

Citations

24 Claims

1. A method for transcribing video, comprising:
- receiving, by a computer-based system from a network, a video feed having video data, audio data, and closed-captioning data, the closed-captioning data indicative of speech defined by the audio data;
  
  identifying a speech segment within the closed-captioning data;
  
  automatically defining, by the computer-based system, a transcript of the speech based on the closed-captioning data in the video feed received from the network;
  
  automatically analyzing, by the computer-based system, the video data, audio data, and closed-captioning data by the computer-based system;
  
  automatically identifying, by the computer-based system, a speaker for the speech segment within the transcript based on the analyzing;
  
  automatically marking, by the computer-based system, the speech segment in the transcript with an identifier of the speaker thereby attributing the speech segment to the speaker;
  
  summarizing the transcript, wherein the summarizing comprises determining an overall percentage of speech attributable to the speaker for the closed-captioning data and selecting portions of the transcript for removal based on the determined overall percentage; and
  
  storing the transcript in memory.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
- - 2. The method of claim 1, wherein the analyzing comprises performing a voice recognition algorithm on the audio data.
  - 3. The method of claim 1, wherein the analyzing comprises performing a facial recognition algorithm on the video data.
  - 4. The method of claim 1, wherein the analyzing comprises:
    - identifying a face within the video data;
      
      determining whether a mouth of the face is moving between video frames; and
      
      correlating the identifier with the face based on the determining.
  - 5. The method of claim 1, wherein the analyzing comprises performing a speaker diarization algorithm on the audio data.
  - 6. The method of claim 1, wherein the analyzing comprises performing an optical character recognition (OCR) algorithm on the video data.
  - 7. The method of claim 1, wherein the analyzing comprises performing a plurality of identification algorithms for identifying the speaker, and wherein the method comprises:
    - automatically assigning, by the computer-based system, a confidence value to each of the identification algorithms; and
      
      automatically selecting, by the computer-based system, at least one of the identification algorithms for use in the identifying the speaker based on the assigned confidence values.
  - 8. The method of claim 1, further comprising automatically identifying, by the computer-based system, at least one commercial in the video feed, wherein the defining is based on the identifying the at least one commercial.
  - 9. The method of claim 1, wherein the analyzing further comprises performing a plurality of identification algorithms for identifying the speaker, wherein the analyzing further comprises performing a machine-learning algorithm for use in identifying the speaker, and wherein a result for each of the plurality of identification algorithms is used in performing the machine-learning algorithm.
  - 10. The method of claim 9, further comprising automatically assigning, by the computer-based system, a confidence value to each of the identification algorithms for use in identifying the speaker.
  - 11. The method of claim 1, further comprising reading a name of a person from the video data using optical character recognition, wherein the identifying the speaker comprises identifying the person as the speaker based on the reading.
  - 12. The method of claim 1, wherein the selecting is performed such that an overall percentage of speech attributable to the speaker for the transcript is kept within a predefined margin of the overall percentage of speech attributable to the speaker for the closed-captioning data.

13. A transcriber system, comprising:
- a network interface for receiving a video feed from a network, the video feed having video data, audio data, and closed-captioning data, wherein the closed-captioning data is indicative of speech defined by the audio data;
  
  memory; and
  
  logic configured to define a transcript of the speech based on the closed-captioning data in the video feed received by the network interface, the logic configured to identify a speech segment within the closed-captioning data, the logic further configured to analyze the video feed and to identify a speaker for the speech segment within the transcript, wherein the logic is configured to mark the speech segment in the transcript with an identifier of the speaker thereby attributing the speech segment to the speaker, wherein the logic is configured to determine an overall percentage of speech attributable to the speaker for the closed-captioning data and to summarize the transcript by selecting portions of the transcript for removal based on the determined overall percentage, and wherein the logic is configured to store the transcript in the memory.
- View Dependent Claims (14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24)
- - 14. The system of claim 13, wherein the logic is configured to perform a voice recognition algorithm on the audio data, and wherein the logic is configured to identify the speaker based on the voice recognition algorithm.
  - 15. The system of claim 13, wherein the logic is configured to perform a facial recognition algorithm on the video data, and wherein the logic is configured to identify the speaker based on the facial recognition algorithm.
  - 16. The system of claim 13, wherein the logic is configured to identify a face within the video data and to make a determination whether a mouth of the face is moving between video frames, and wherein the logic is configured to identify the speaker based on the determination.
  - 17. The system of claim 13, wherein the logic is configured to perform a speaker diarization algorithm on the audio data, and wherein the logic is configured to identify the speaker based on the speaker diarization algorithm.
  - 18. The system of claim 13, wherein the logic is configured to perform an optical character recognition (OCR) algorithm on the video data, and wherein the logic is configured to identify the speaker based on the OCR algorithm.
  - 19. The system of claim 13, wherein the logic is configured to perform a plurality of identification algorithms for identifying the speaker, wherein the logic is configured to assign a confidence value to each of the identification algorithms, and wherein the logic is configured to select at least one of the identification algorithms for use in the identifying the speaker based on the assigned confidence values.
  - 20. The system of claim 13, wherein the logic is configured to identify at least one commercial in the video feed and to define the transcript based on the identified commercial such that the transcript omits at least one speech segment associated with the identified commercial.
  - 21. The system of claim 13, wherein the logic is configured to perform a plurality of identification algorithms for identifying the speaker, and wherein the logic is configured to use a result for each of the plurality of identification algorithms in a machine-learning algorithm for identifying the speaker of the speech segment.
  - 22. The system of claim 21, wherein the logic is configured to assign a confidence value to each of the identification algorithms for use by the machine-learning algorithm in identifying the speaker of the speech segment.
  - 23. The system of claim 13, wherein the logic is configured to determine a speaker transition between consecutive speech segments in the transcript and to insert a speaker name into the transcript identifying the speaker of one of the consecutive speech segments.
  - 24. The system of claim 13, wherein the logic is configured to select the portions of the transcript for removal such that an overall percentage of speech attributable to the speaker for the transcript is kept within a predefined margin of the overall percentage of speech attributable to the speaker for the closed-captioning data.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
The Board Of Trustees Of The University Of Alabama And The University Of Alabama In Huntsville
Original Assignee
The Board Of Trustees Of The University Of Alabama And The University Of Alabama In Huntsville
Inventors
Woo, Daniel Newton
Primary Examiner(s)
Wozniak, James

Application Number

US14/641,205
Time in Patent Office

1,544 Days
Field of Search

704235, 704246, 348468, 382118
US Class Current
CPC Class Codes

G10L 15/04   Segmentation; Word boundary...

G10L 15/25   using position of the lips,...

G10L 15/26   Speech to text systems G10L...

G10L 17/00   Speaker identification or v...

G10L 17/02   Preprocessing operations, e...

G10L 17/06   Decision making techniques;...

G10L 17/10   Multimodal systems, i.e. ba...

H04N 21/435   Processing of additional da...

H04N 21/4394   involving operations for an...

H04N 21/44008   involving operations for an...

H04N 21/440236   by media transcoding, e.g. ...

H04N 21/4884   for displaying subtitles

H04N 21/4888   for displaying teletext cha...

H04N 21/812   involving advertisement dat...

Systems and methods for transcribing videos using speaker identification

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

24 Claims

Specification

Solutions

Use Cases

Quick Links

Systems and methods for transcribing videos using speaker identification

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

24 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links