Text transcript generation from a communication session
First Claim
1. A method for transcribing speech from a real-time communication session, the method comprising:
- receiving, by one or more processors, a combined media stream comprising a plurality of media sub-streams each associated with one of a plurality of end user devices, wherein each of the plurality of media sub-streams comprises a respective video component and a respective audio component, wherein each of the plurality of media sub-streams in the combined media stream is separable from others of the plurality of media sub-streams in the combined media stream;
separating, by the one or more processors, each of the media sub-streams in the combined media stream from the combined media stream;
for each of the separated media sub-streams, separating, by the one or more processors, the respective audio component from the respective video component;
for each separated audio component of the respective media sub-streams associated with one of the plurality of end user devices;
identifying one or more periods of non-speech based on an amplitude of an audio signal, each period of non-speech indicative of a break between phrases of speech;
generating a plurality of portions of audio based on the identified periods of non-speech; and
separately transcribing, by the one or more processors, speech from the plurality of portions of audio to text for the respective media sub-stream;
combining, by the one or more processors, the separately transcribed text for each of the respective media sub-streams into a combined transcription; and
annotating the text to include additional content by determining one or more keywords of the text and selecting, based on the one or more keywords, one or more advertisements.
3 Assignments
0 Petitions
Accused Products
Abstract
Techniques, systems, and devices for managing streaming media among end user devices in a video conferencing system are described. For example, a transcript may be automatically generated for a video conference. In one example, a method may include receiving a combined media stream comprising a plurality of media sub-streams each associated with one of a plurality of end user devices, wherein each of the plurality of media sub-streams comprises a respective video component and a respective audio component. The method may also include, for each of the media-sub-streams, separating the audio component from the respective video component, for each audio component of the respective media sub-streams, transcribing speech from the audio component to text for the respective media sub-stream, and combining the text for each of the respective media sub-streams into a combined transcription. In some examples, the combined transcription may also be translated into a user selected language.
150 Citations
19 Claims
-
1. A method for transcribing speech from a real-time communication session, the method comprising:
-
receiving, by one or more processors, a combined media stream comprising a plurality of media sub-streams each associated with one of a plurality of end user devices, wherein each of the plurality of media sub-streams comprises a respective video component and a respective audio component, wherein each of the plurality of media sub-streams in the combined media stream is separable from others of the plurality of media sub-streams in the combined media stream; separating, by the one or more processors, each of the media sub-streams in the combined media stream from the combined media stream; for each of the separated media sub-streams, separating, by the one or more processors, the respective audio component from the respective video component; for each separated audio component of the respective media sub-streams associated with one of the plurality of end user devices; identifying one or more periods of non-speech based on an amplitude of an audio signal, each period of non-speech indicative of a break between phrases of speech; generating a plurality of portions of audio based on the identified periods of non-speech; and separately transcribing, by the one or more processors, speech from the plurality of portions of audio to text for the respective media sub-stream; combining, by the one or more processors, the separately transcribed text for each of the respective media sub-streams into a combined transcription; and annotating the text to include additional content by determining one or more keywords of the text and selecting, based on the one or more keywords, one or more advertisements.
-
-
2. A method for transcribing speech in a communication session comprising:
-
receiving, by one or more processors, a combined media stream comprising a plurality of media sub-streams each associated with one of a plurality of end user devices, wherein each of the plurality of media sub-streams in the combined media stream comprises a respective video component and a respective audio component, wherein each of the plurality of media sub-streams in the combined media stream is separable from others of the plurality of media sub-streams in the combined media stream; separating each of the media sub-streams in the combined media stream from the combined media stream; for each of the separated media sub-streams, separating, by the one or more processors, the respective audio component from the respective video component; for each separate audio component of each of the respective media sub-streams associated with one of the plurality of end user devices, separately transcribing, by the one or more processors, speech from the audio component to text for the respective media sub-stream; combining, by the one or more processors, the separately transcribed text for each of the respective media sub-streams into a combined transcription that identifies text associated with each end user device based on the respective sub-stream; and annotating the text for the audio component of each respective media sub-stream to include additional content, wherein annotating the text comprises; determining one or more keywords of the text; selecting, based on the one or more keywords, one or more hyperlinks; and inserting at least one of the one or more hyperlinks into the text. - View Dependent Claims (3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. A server device comprising:
one or more processors configured to; receive by a communication server a combined media stream comprising a plurality of media sub-streams each associated with one of a plurality of end user devices, wherein each of the plurality of media sub-streams in the combined media stream comprises a respective video component and a respective audio component, wherein each of the plurality of media sub-streams in the combined media stream is separable from others of the plurality of media sub-streams in the combined media stream; separate, by the communication server, each of the media sub-streams in the combined media stream from the combined media stream; for each of the separated media sub-streams, separate by the communication server the respective audio component from the respective video component; for each audio component of each of the respective media sub-streams, separately transcribe by a transcription module, using one or more speech-to-text units, speech from the audio component to text for the respective media sub-stream; combine by an annotation module the text for each of the respective media sub-streams into a combined transcription that identifies text associated with each end user device based on the respective sub-stream;
anannotate the text for the audio component of each respective media sub-stream to include additional content, wherein annotating the text comprises; determining one or more keywords of the text; selecting, based on the one or more keywords, one or more hyperlinks; and inserting at least one of the one or more hyperlinks into the text. - View Dependent Claims (13, 14, 15, 16, 17, 18, 19)
Specification