Method and apparatus for presenting images representative of an utterance with corresponding decoded speech
First Claim
1. Apparatus for presenting images representative of one or more words in an utterance with corresponding decoded speech, the apparatus comprising:
- a visual detector, the visual detector capturing images of body movements substantially concurrently from the one or more words in the utterance;
a visual feature extractor coupled to the visual detector, the visual feature extractor receiving time information from an automatic speech recognition (ASR) system and operatively processing the captured images into one or more image segments based on the time information relating to one or more words, decoded by the ASR system, in the utterance, each image segment comprising a plurality of successive images in time corresponding to a decoded word in the utterance; and
an image player operatively coupled to the visual feature extractor, the image player receiving and presenting decoded word with each image segment generated therefrom;
wherein the image player repeatedly presents one or more image segments with the corresponding decoded word by looping on a time sequence of successive images correspondina to the decoded word, wherein the image player displays each image segment in a separate window on a display in close proximity to the decoded speech text corresponding to the image segment.
2 Assignments
0 Petitions
Accused Products
Abstract
Apparatus for presenting images representative of one or more words in an utterance with corresponding decoded speech includes, in one aspect, a visual detector for capturing images of body movements (e.g., lip and/or mouth movements) corresponding to the one or more words in the utterance coupled to a visual feature extractor. The visual feature extractor receives time information from an automatic speech recognition (ASR) system and operatively processes the captured images from the visual detector to generate one or more image segments based on the time information relating to one or more decoded words in the utterance, each image segment corresponding to a decoded word in the utterance. An image player coupled to the visual feature extractor presents an image segment with a corresponding decoded word. The image segment may be presented as an animation of successive images in time, whereby a user is provided multiple sources of information for comprehending the utterance and can more easily ascertain the relationship between the body movements and the corresponding decoded speech.
-
Citations
19 Claims
-
1. Apparatus for presenting images representative of one or more words in an utterance with corresponding decoded speech, the apparatus comprising:
-
a visual detector, the visual detector capturing images of body movements substantially concurrently from the one or more words in the utterance; a visual feature extractor coupled to the visual detector, the visual feature extractor receiving time information from an automatic speech recognition (ASR) system and operatively processing the captured images into one or more image segments based on the time information relating to one or more words, decoded by the ASR system, in the utterance, each image segment comprising a plurality of successive images in time corresponding to a decoded word in the utterance; and an image player operatively coupled to the visual feature extractor, the image player receiving and presenting decoded word with each image segment generated therefrom; wherein the image player repeatedly presents one or more image segments with the corresponding decoded word by looping on a time sequence of successive images correspondina to the decoded word, wherein the image player displays each image segment in a separate window on a display in close proximity to the decoded speech text corresponding to the image segment. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. Apparatus for presenting images representative of one or more words in an utterance with corresponding decoded speech, the apparatus comprising:
-
an automatic speech recognition (ASR) engine for converting the utterance into one or more decoded words, the ASR engine generating time information associated with each of the decoded words; a visual detector, the visual detector capturing images of body movements substantially concurrently from one or more words in the utterance; a visual feature extractor coupled to the visual detector, the visual feature extractor receiving the time information from the ASR engine and operatively processing the captured images into one or more image segments based on the time information relating to the decoded words, each image segment comprising a plurality of successive images in time corresponding to a decoded word in the utterance; and an image player operatively coupled to the visual feature extractor, the image player receiving and presenting the decoded word with each image segment generated therefrom; wherein the image player repeatedly presents one or more image segments with the corresponding decoded word by looping on a time sequence of successive images corresponding to the decoded word, wherein the image player displays each image segment in a separate window on a display in close proximity to the decoded speech text corresponding to the image segment. - View Dependent Claims (9)
-
-
10. A method for presenting images representative of one or more words in an utterance with corresponding decoded speech, the method comprising the steps of:
-
capturing a plurality of images representing body movements substantially concurrently from the one or more words in the utterance; associating each of the captured images generated from the one or more words in the utterance with time information relating to an occurrence of the image; receiving, from an automatic speech recognition (ASR) system, data including a start time and an end time of a word decoded by the ASR system; aligning the plurality of images into one or more image segments according to the start and stop times received from the ASR system, wherein each image segment corresponds to a decoded word in the utterance; and presenting the decoded word with the corresponding image segment generated therefrom; wherein the step of presenting the decoded word with the correspondina image segment generated therefrom comprises repeatedly looping on a time sequence of successive images corresponding to the decoded word, wherein the step of presenting displays each image segment in a separate window on a display in close proximity to the decoded speech text corresponding to the image segment. - View Dependent Claims (11, 12, 13, 14)
-
-
15. In an automatic speech recognition (ASR) system for converting an utterance of a speaker into one or more decoded words, a method for enhancing the ASR system comprising the steps of:
-
capturing a plurality of successive images in time representing body movements substantially concurrently from one or more words in the utterance; associating each of the captured images generated from the one or more words in the utterance with time information relating to an occurrence of the image; obtaining, from the ASR system, time ends for each decoded word in the utterance; grouping the plurality of images into one or more image segments based on the time ends, wherein each image segment corresponds to a decoded word in the utterance; and presenting the decoded word with the corresponding image segment generated therefrom; wherein the step of presenting the decoded word with the corresponding image segment generated therefrom comprises repeatedly looping on a time sequence of successive images corresponding to the decoded word, wherein the step of presenting displays each image segment in a separate window on a display in close proximity to the decoded speech text corresponding to the image segment. - View Dependent Claims (16, 17, 18)
-
-
19. A method for presenting images representative of one or more words in an utterance with corresponding decoded speech, the method comprising the steps of:
-
providing an automatic speech recognition (ASR) engine; decoding, in the ASR engine, the utterance into one or more words, each of the decoded words having a start time and a stop time associated therewith; capturing a plurality of images representing body movements substantially concurrently from the one or more words in the utterance; buffering the plurality of images by a predetermined delay; receiving, from the ASR engine, data including the start time and the end time of a decoded word; aligning the plurality of images into one or more image segments according to the start and stop times received from the ASR engine, wherein each image segment corresponds to a decoded word in the utterance; and presenting the decoded word with the corresponding image segment generated therefrom; wherein the step of presenting the decoded word with the corresponding image segment generated therefrom comprises repeatedly looping on a time sequence of successive images corresponding to the decoded word, wherein the step of presenting displays each image segment in a separate window on a display in close proximity to the decoded speech text corresponding to the image segment.
-
Specification