Cognitive print speaker modeler

US 10,621,990 B2
Filed: 04/30/2018
Issued: 04/14/2020
Est. Priority Date: 04/30/2018
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method for subtitling of streaming video with audio, comprising executing on a computer processor:

identifying a speaker in a streaming video with audio according to words spoken by the speaker matched to a cognitive print, wherein the cognitive print comprises a plurality of traits classified according to a hierarchical long short term memory (LSTM) model, wherein the hierarchical LSTM model comprises a plurality of layers of LSTMs and each layer corresponds to the classification of one trait of the plurality of traits;

annotating a subtitle of the words spoken by the speaker, which decorates the subtitle with a label representative of the identified speaker; and

adding the decorated subtitle to the streaming video with audio.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Aspects of the present invention provide devices that subtitle streaming video with audio and identify a speaker in a streaming video with audio according to words spoken by the speaker matched to a cognitive print. The cognitive print includes traits classified according a hierarchical long short term model (LSTM). The hierarchical LSTM includes layers of LSTMs and each layer corresponds to the classification of one trait. A processor annotates a subtitle of the words spoken by the speaker, which decorates the subtitle with a label representative of the identified speaker, and streams the decorated subtitle with the streaming video with audio.

Citations

20 Claims

1. A computer-implemented method for subtitling of streaming video with audio, comprising executing on a computer processor:
- identifying a speaker in a streaming video with audio according to words spoken by the speaker matched to a cognitive print, wherein the cognitive print comprises a plurality of traits classified according to a hierarchical long short term memory (LSTM) model, wherein the hierarchical LSTM model comprises a plurality of layers of LSTMs and each layer corresponds to the classification of one trait of the plurality of traits;
  
  annotating a subtitle of the words spoken by the speaker, which decorates the subtitle with a label representative of the identified speaker; and
  
  adding the decorated subtitle to the streaming video with audio.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method of claim 1, wherein identifying the speaker in the streaming video with audio comprises:
    - distinguishing a plurality of speakers in the streaming video with audio, each according to corresponding words spoken by the speaker and a corresponding cognitive print, wherein the corresponding cognitive print differs for each speaker by at least one trait of the plurality of traits.
  - 3. The method of claim 1, wherein the plurality of traits comprises a trait selected from the group consisting of tone, stress, pitch, sentiment, social propensity, prosody, and accent.
  - 4. The method of claim 1, further comprising:
    - identifying cues in the streaming video with audio, wherein the cues comprises indicators of a time and a location;
      
      determining a liveness indicator for the streaming video with audio which indicates whether the streaming video with audio is streamed live by comparing the cues with external electronic sources; and
      
      annotating the subtitle of the words spoken by the speaker, which includes the liveness indicator in the decorated subtitle.
  - 5. The method of claim 4, further comprising:
    - selecting an automatic speech recognition algorithm from a plurality of automatic speech recognition algorithms according to the liveness indicator and a policy; and
      
      annotating the subtitle of the words spoken by the speaker, which decorates the subtitle of words spoken by the identified speaker with the text of spoken words according to the selected automatic speech recognition algorithm.
  - 6. The method of claim 4, wherein the cues in the streaming video comprise a cue selected from a group consisting of:
    - objects indicative of weather shown in video, noises indicative of weather in audio, objects identifying an event, objects identifying a location, facial recognition of persons present in video, statements of an event in audio, statements of a location in audio, statements of a time in audio, objects indicative of time present in video, objects indicative of speaker health in video, and audio indicative of speaker health; and
      
      wherein external electronic sources comprise a source selected from a group consisting of;
      
      an electronic source of weather reported for a geographic location at current time, an electronic source of an event calendar, a known live electronic video feed from a geographic location, a known live electronic audio feed for a geographic location, an electronic source of health records for a speaker, an electronic source of reported news, an electronic source of a person'"'"'s itinerary, and an electronic sources of social media.
  - 7. The method of claim 1, further comprising:
    - integrating computer-readable program code into a computer system comprising a processor, a computer readable memory in circuit communication with the processor, and a computer readable storage medium in circuit communication with the processor; and
      
      wherein the processor executes program code instructions stored on the computer-readable storage medium via the computer readable memory and thereby performs the identifying a speaker in a streaming video with audio according to words spoken by the speaker matched to a cognitive print.
  - 8. The method of claim 7, wherein the computer-readable program code is provided as a service in a cloud environment.

9. A system for subtitling streaming video with audio, comprising:
- a processor;
  
  a computer readable memory in circuit communication with the processor; and
  
  a computer readable storage medium in circuit communication with the processor;
  
  wherein the processor executes program instructions stored on the computer-readable storage medium via the computer readable memory and thereby;
  
  identify a speaker in a streaming video with audio according to words spoken by the speaker matched to a cognitive print, wherein the cognitive print comprises a plurality of traits classified according to a hierarchical long short term memory (LSTM) model, wherein the hierarchical LSTM model comprises a plurality of layers of LSTMs and each layer of LSTMs corresponds to the classification of one trait of the plurality of traits;
  
  annotate a subtitle of the words spoken by the speaker, which decorates the subtitle with a label representative of the identified speaker; and
  
  stream the decorated subtitle with the streaming video with audio.
- View Dependent Claims (10, 11, 12, 13, 14)
- - 10. The system of claim 9, wherein the processor executes program instructions stored on the computer-readable storage medium via the computer readable memory and thereby:
    - distinguish a plurality of speakers in the streaming video with audio, each according to corresponding words spoken by the speaker and a corresponding cognitive print, wherein the corresponding cognitive print differs for each speaker by at least one trait of the plurality of traits.
  - 11. The system of claim 9, wherein the plurality of traits comprises a trait selected from a group consisting of tone, stress, pitch, sentiment, social propensity, prosody, and accent.
  - 12. The system of claim 9, wherein the processor executes program instructions stored on the computer-readable storage medium via the computer readable memory and thereby:
    - identify cues in the streaming video with audio, wherein the cues comprises indicators of a time and a location;
      
      determine a liveness indicator for the streamed video with audio which indicates whether the streaming video with audio is streamed live by comparing the cues with external electronic sources; and
      
      annotate the subtitle of the words spoken by the speaker, which includes the liveness indicator in the decorated subtitle.
  - 13. The system of claim 12, wherein the processor executes program instructions stored on the computer-readable storage medium via the computer readable memory and thereby:
    - select an automatic speech recognition algorithm from a plurality of automatic speech recognition algorithms according to the liveness indicator and a policy; and
      
      annotate the subtitle of the words spoken by the speaker, which decorates the subtitle of words spoken by the identified speaker with the text of spoken words according to the selected automatic speech recognition algorithm.
  - 14. The system of claim 13, wherein the cues in the streaming video comprise a cue selected from a group consisting of:
    - objects indicative of weather shown in video, noises indicative of weather in audio, objects identifying an event, objects identifying a location, facial recognition of persons present in video, statements of an event in audio, statements of a location in audio, statements of a time in audio, objects indicative of time present in video, objects indicative of speaker health in video, and audio indicative of speaker health; and
      
      wherein external electronic sources comprise a source selected from a group consisting of;
      
      an electronic source of weather reported for a geographic location at current time, an electronic source of an event calendar, a known live electronic video feed from a geographic location, a known live electronic audio feed for a geographic location, an electronic source of health records for a speaker, an electronic source of reported news, an electronic source of a person'"'"'s itinerary, and an electronic sources of social media.

15. A computer program product for subtitling streaming video with audio, the computer program product comprising:
- a computer readable storage medium having computer readable program code embodied therewith, wherein the computer readable storage medium is not a transitory signal per se, the computer readable program code comprising instructions for execution by a processor that causes the processor to;
  
  identify a speaker in a streaming video with audio according to words spoken by the speaker matched to a cognitive print, wherein the cognitive print comprises a plurality of traits classified according to a hierarchical long short term memory (LSTM) model, wherein the hierarchical LSTM model comprises a plurality of layers of LSTMs and each layer corresponds to the classification of one trait of the plurality of traits;
  
  annotate a subtitle of the words spoken by the speaker, which decorates the subtitle with a label representative of the identified speaker; and
  
  stream the decorated subtitle with the streaming video with audio.
- View Dependent Claims (16, 17, 18, 19, 20)
- - 16. The computer program product of claim 15, wherein the instructions for execution cause the processor to:
    - distinguish a plurality of speakers in the streaming video with audio, each according to corresponding words spoken by the speaker and a corresponding cognitive print, wherein the corresponding cognitive print differs for each speaker by at least one trait of the plurality of traits.
  - 17. The computer program product of claim 15, wherein the plurality of traits comprises a trait selected from a group consisting of tone, stress, pitch, sentiment, social propensity, prosody, and accent.
  - 18. The computer program product of claim 15, wherein the instructions for execution cause the processor to:
    - identify cues in the streaming video with audio, wherein the cues comprises indicators of a time and a location;
      
      determine a liveness indicator for the streaming video with audio which indicates whether the streaming video with audio is streamed live by comparing the cues with external electronic sources; and
      
      annotate the subtitle of the words spoken by the speaker, which includes the liveness indicator in the decorated subtitle.
  - 19. The computer program product of claim 18, wherein the instructions for execution cause the processor to:
    - select an automatic speech recognition algorithm from a plurality of automatic speech recognition algorithms according to the liveness indicator and a policy; and
      
      annotate the subtitle of the words spoken by the speaker, which decorates the subtitle of words spoken by the identified speaker with the text of spoken words according to the selected automatic speech recognition algorithm.
  - 20. The computer program product of claim 18, wherein the instructions for execution cause the processor to:
    - wherein the cues in the streamed video comprise a cue selected from a group consisting of;
      
      objects indicative of weather shown in video, noises indicative of weather in audio, objects identifying an event, objects identifying a location, facial recognition of persons present in video, statements of an event in audio, statements of a location in audio, statements of a time in audio, objects indicative of time present in video, objects indicative of speaker health in video, and audio indicative of speaker health; and
      
      wherein external electronic sources comprise a source selected from a group consisting of;
      
      an electronic source of weather reported for a geographic location at current time, an electronic source of an event calendar, a known live electronic video feed from a geographic location, a known live electronic audio feed for a geographic location, an electronic source of health records for a speaker, an electronic source of reported news, an electronic source of a person'"'"'s itinerary, and an electronic sources of social media.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Amsterdam, Jeff, Baughman, Aaron K., Hammer, Stephen C., Provan, David A.
Primary Examiner(s)
Colucci, Michael C

Application Number

US15/966,122
Publication Number

US 20190333520A1
Time in Patent Office

715 Days
Field of Search

704235, 704233, 704232, 704500, 704 9, 704234
US Class Current
CPC Class Codes

G06F 16/7834   using audio features

G06F 16/784   the detected or recognised ...

G06F 18/24323   Tree-organised classifiers

G06V 10/764   using classification, e.g. ...

G06V 20/44   Event detection

G06V 40/10   Human or animal bodies, e.g...

G10L 17/00   Speaker identification or v...

G10L 17/04   Training, enrolment or mode...

G10L 17/06   Decision making techniques;...

G10L 17/18   Artificial neural networks;...

G10L 17/26   Recognition of special voic...

Cognitive print speaker modeler

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Cognitive print speaker modeler

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links