Text transcript generation from a communication session

US 9,443,518 B1
Filed: 08/30/2012
Issued: 09/13/2016
Est. Priority Date: 08/31/2011
Status: Active Grant

First Claim

Patent Images

1. A method for transcribing speech from a real-time communication session, the method comprising:

receiving, by one or more processors, a combined media stream comprising a plurality of media sub-streams each associated with one of a plurality of end user devices, wherein each of the plurality of media sub-streams comprises a respective video component and a respective audio component, wherein each of the plurality of media sub-streams in the combined media stream is separable from others of the plurality of media sub-streams in the combined media stream;

separating, by the one or more processors, each of the media sub-streams in the combined media stream from the combined media stream;

for each of the separated media sub-streams, separating, by the one or more processors, the respective audio component from the respective video component;

for each separated audio component of the respective media sub-streams associated with one of the plurality of end user devices;

identifying one or more periods of non-speech based on an amplitude of an audio signal, each period of non-speech indicative of a break between phrases of speech;

generating a plurality of portions of audio based on the identified periods of non-speech; and

separately transcribing, by the one or more processors, speech from the plurality of portions of audio to text for the respective media sub-stream;

combining, by the one or more processors, the separately transcribed text for each of the respective media sub-streams into a combined transcription; and

annotating the text to include additional content by determining one or more keywords of the text and selecting, based on the one or more keywords, one or more advertisements.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Techniques, systems, and devices for managing streaming media among end user devices in a video conferencing system are described. For example, a transcript may be automatically generated for a video conference. In one example, a method may include receiving a combined media stream comprising a plurality of media sub-streams each associated with one of a plurality of end user devices, wherein each of the plurality of media sub-streams comprises a respective video component and a respective audio component. The method may also include, for each of the media-sub-streams, separating the audio component from the respective video component, for each audio component of the respective media sub-streams, transcribing speech from the audio component to text for the respective media sub-stream, and combining the text for each of the respective media sub-streams into a combined transcription. In some examples, the combined transcription may also be translated into a user selected language.

150 Citations

View as Search Results

19 Claims

1. A method for transcribing speech from a real-time communication session, the method comprising:
- receiving, by one or more processors, a combined media stream comprising a plurality of media sub-streams each associated with one of a plurality of end user devices, wherein each of the plurality of media sub-streams comprises a respective video component and a respective audio component, wherein each of the plurality of media sub-streams in the combined media stream is separable from others of the plurality of media sub-streams in the combined media stream;
  
  separating, by the one or more processors, each of the media sub-streams in the combined media stream from the combined media stream;
  
  for each of the separated media sub-streams, separating, by the one or more processors, the respective audio component from the respective video component;
  
  for each separated audio component of the respective media sub-streams associated with one of the plurality of end user devices;
  
  identifying one or more periods of non-speech based on an amplitude of an audio signal, each period of non-speech indicative of a break between phrases of speech;
  
  generating a plurality of portions of audio based on the identified periods of non-speech; and
  
  separately transcribing, by the one or more processors, speech from the plurality of portions of audio to text for the respective media sub-stream;
  
  combining, by the one or more processors, the separately transcribed text for each of the respective media sub-streams into a combined transcription; and
  
  annotating the text to include additional content by determining one or more keywords of the text and selecting, based on the one or more keywords, one or more advertisements.

2. A method for transcribing speech in a communication session comprising:
- receiving, by one or more processors, a combined media stream comprising a plurality of media sub-streams each associated with one of a plurality of end user devices, wherein each of the plurality of media sub-streams in the combined media stream comprises a respective video component and a respective audio component, wherein each of the plurality of media sub-streams in the combined media stream is separable from others of the plurality of media sub-streams in the combined media stream;
  
  separating each of the media sub-streams in the combined media stream from the combined media stream;
  
  for each of the separated media sub-streams, separating, by the one or more processors, the respective audio component from the respective video component;
  
  for each separate audio component of each of the respective media sub-streams associated with one of the plurality of end user devices, separately transcribing, by the one or more processors, speech from the audio component to text for the respective media sub-stream;
  
  combining, by the one or more processors, the separately transcribed text for each of the respective media sub-streams into a combined transcription that identifies text associated with each end user device based on the respective sub-stream; and
  
  annotating the text for the audio component of each respective media sub-stream to include additional content, wherein annotating the text comprises;
  
  determining one or more keywords of the text;
  
  selecting, based on the one or more keywords, one or more hyperlinks; and
  
  inserting at least one of the one or more hyperlinks into the text.
- View Dependent Claims (3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 3. The method of claim 2, further comprising:
    - identifying, for each audio component of the respective media sub-streams, one or more periods of non-speech based on a continuous signal value for a predetermined amount of time, each period indicative of a break between phrases of speech; and
      
      generating, for each of the respective media sub-streams, a plurality of portions of audio based on the identified periods of non-speech, wherein transcribing speech from the audio component comprises transcribing the plurality of portions of audio.
  - 4. The method of claim 3, further comprising, for each of the media sub-streams, associating one or more time tags with the respective portions of the text, wherein each of the one or more time tags indicate when the respective portions of the text occurred within the media stream, and wherein combining the text for each of the respective media sub-streams into the combined transcription further comprises combining the text for each of the respective media sub-streams into the combined transcription based on the time tags associated with each respective portion of the text.
  - 5. The method of claim 4, wherein a beginning and end of sentences are marked with the time tags.
  - 6. The method of claim 2, further comprising:
    - annotating the text for the audio component of each respective media sub-stream to include first additional content for a first user device to produce first separately transcribed text;
      
      combining, by the one or more processors, the first separately transcribed text for each of the respective media sub-streams into a first combined transcription that identifies text associated with each end user device based on the respective sub-stream;
      
      annotating the text for the audio component of each respective media sub-stream to include second additional content for a second user device to produce second separately transcribed text, wherein the second separately transcribed text is different from the first separately transcribed text;
      
      combining, by the one or more processors, the second separately transcribed text for each of the respective media sub-streams into a second combined transcription that identifies text associated with each end user device based on the respective sub-stream;
      
      outputting, for display at the first user device, the first combined transcription; and
      
      outputting, for display at the second user device, the second combined transcription.
  - 7. The method of claim 2, wherein annotating the text further comprises selecting, based on the one or more keywords, inserted web elements.
  - 8. The method of claim 2, wherein annotating the text to include additional content further comprises selecting, based on the one or more keywords, one or more advertisements.
  - 9. The method of claim 2, further comprising:
    - receiving an indication of a selected language from a user associated with one of the plurality of end user devices;
      
      translating the combined transcription into the selected language; and
      
      providing the translation of the combined transcription for purposes of display at the one of the plurality of end user devices associated with the user.
  - 10. The method of claim 2, further comprising outputting, for display at one or more of the plurality of end user devices, the combined transcription.
  - 11. The method of claim 2, wherein the plurality of media sub-streams are generated during a real-time communication session, and wherein the combined transcription is representative of at least a portion of speech during the real-time communication session.

12. A server device comprising:
- one or more processors configured to;
  
  receive by a communication server a combined media stream comprising a plurality of media sub-streams each associated with one of a plurality of end user devices, wherein each of the plurality of media sub-streams in the combined media stream comprises a respective video component and a respective audio component, wherein each of the plurality of media sub-streams in the combined media stream is separable from others of the plurality of media sub-streams in the combined media stream;
  
  separate, by the communication server, each of the media sub-streams in the combined media stream from the combined media stream;
  
  for each of the separated media sub-streams, separate by the communication server the respective audio component from the respective video component;
  
  for each audio component of each of the respective media sub-streams, separately transcribe by a transcription module, using one or more speech-to-text units, speech from the audio component to text for the respective media sub-stream;
  
  combine by an annotation module the text for each of the respective media sub-streams into a combined transcription that identifies text associated with each end user device based on the respective sub-stream;
  
  anannotate the text for the audio component of each respective media sub-stream to include additional content, wherein annotating the text comprises;
  
  determining one or more keywords of the text;
  
  selecting, based on the one or more keywords, one or more hyperlinks; and
  
  inserting at least one of the one or more hyperlinks into the text.
- View Dependent Claims (13, 14, 15, 16, 17, 18, 19)
- - 13. The server device of claim 12, wherein the one or more processors are further configured to:
    - associate one or more time tags with respective portions of the text, wherein each of the one or more time tags indicate when respective portions of the text occurred within the media stream; and
      
      combine the text for each of the respective media sub-streams into the combined transcription based on the time tags associated with each respective portions of the text.
  - 14. The server device of claim 13, wherein the one or more processors are configured to arrange respective portions of the text substantially chronologically within the combined transcription according to the time tags.
  - 15. The server device of claim 13, wherein a beginning and end of sentences are marked with time tags.
  - 16. The server device of claim 15, wherein the one or more processors are further configured to:
    - determine one or more keywords of the text; and
      
      select, based on the one or more keywords, inserted web elements for inclusion in the text as the additional content.
  - 17. The server device of claim 12, wherein the one or more processors are further configured to:
    - receive an indication of a selected language from a user associated with one of the plurality of end user devices;
      
      translate the combined transcription into the selected language; and
      
      provide the translation of the combined transcription for purposes of display at the one of the end user devices associated with the user.
  - 18. The server device of claim 12, wherein the one or more processors are further configured to output, for display at one or more of the end user devices, the combined transcription.
  - 19. The server device of claim 12, wherein the plurality of media sub-streams are generated during a real-time communication session, and wherein the combined transcription is representative of at least a portion of speech during the real-time communication session.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Gauci, Jason John
Primary Examiner(s)
Desir, Pierre-Louis
Assistant Examiner(s)
Shin, Seong-Ah A.

Application Number

US13/599,908
Time in Patent Office

1,475 Days
Field of Search

704/2, 704/9, 704/235, 704/10, 704/243, 704/270, 704/270.1, 704/275, 704/3, 705/319, 707/102, 707/769, 381/315
US Class Current

1/1
CPC Class Codes

G06F 40/134   Hyperlinking

G06F 40/169   Annotation, e.g. comment da...

G06F 40/40   Processing or translation o...

G06Q 30/0277   Online advertisement

G10L 15/08   Speech classification or se...

G10L 15/26   Speech to text systems G10L...

G10L 2015/088   Word spotting

G10L 25/78   Detection of presence or ab...

H04L 65/403   Arrangements for multi-part...

H04M 2203/2061   Language aspects

H04M 3/56   Arrangements for connecting...

H04M 7/0012   Details of application prog...

H04N 7/15   Conference systems

Text transcript generation from a communication session

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

150 Citations

19 Claims

Specification

Solutions

Use Cases

Quick Links

Text transcript generation from a communication session

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

150 Citations

19 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links