Methods and apparatus to generate a speech recognition library
First Claim
1. A device, comprising:
- a memory to store instructions; and
a processor coupled to the memory, wherein responsive to executing the instructions, the processor performs operations comprising;
receiving video media content, wherein the video media content comprises images, audio content, and closed captioning of text from the audio content;
detecting an occurrence of a textual phrase in the closed captioning of the video media content as a detected occurrence;
selecting an audio segment from the audio content of the media content as a selected audio segment, wherein the selected audio segment corresponds to the detected occurrence of the textual phrase in the closed captioning;
selecting from a speech recognition library an audio pronunciation associated with the textual phrase, wherein the speech recognition library comprises a group of identified audio segments, wherein the group of identified audio segments comprises a baseline audio pronunciation and collected audio pronunciations of the textual phrase;
comparing the selected audio segment with the group of identified audio segments from the speech recognition library;
determining if an audio pronunciation of the selected audio segment differs from the baseline audio pronunciation from the speech recognition library;
responsive to determining that the audio pronunciation of the selected audio segment differs from the baseline audio pronunciation, generating a phonetic transcription of the audio pronunciation of the selected audio segment; and
adding the phonetic transcription and the textual phrase to the group of identified audio segments in the speech recognition library to populate the collected audio pronunciations of the selected audio segment.
4 Assignments
0 Petitions
Accused Products
Abstract
Methods and apparatus to generate a speech recognition library for use by a speech recognition system are disclosed. An example method comprises identifying a plurality of video segments having closed caption data corresponding to a phrase, the plurality of video segments associated with respective ones of a plurality of audio data segments, computing a plurality of difference metrics between a baseline audio data segment associated with the phrase and respective ones of the plurality of audio data segments, selecting a set of the plurality of audio data segments based on the plurality of difference metrics, identifying a first one of the audio data segments in the set as a representative audio data segment, determining a first phonetic transcription of the representative audio data segment, and adding the first phonetic transcription to a speech recognition library when the first phonetic transcription differs from a second phonetic transcription associated with the phrase in the speech recognition library.
84 Citations
18 Claims
-
1. A device, comprising:
-
a memory to store instructions; and a processor coupled to the memory, wherein responsive to executing the instructions, the processor performs operations comprising; receiving video media content, wherein the video media content comprises images, audio content, and closed captioning of text from the audio content; detecting an occurrence of a textual phrase in the closed captioning of the video media content as a detected occurrence; selecting an audio segment from the audio content of the media content as a selected audio segment, wherein the selected audio segment corresponds to the detected occurrence of the textual phrase in the closed captioning; selecting from a speech recognition library an audio pronunciation associated with the textual phrase, wherein the speech recognition library comprises a group of identified audio segments, wherein the group of identified audio segments comprises a baseline audio pronunciation and collected audio pronunciations of the textual phrase; comparing the selected audio segment with the group of identified audio segments from the speech recognition library; determining if an audio pronunciation of the selected audio segment differs from the baseline audio pronunciation from the speech recognition library; responsive to determining that the audio pronunciation of the selected audio segment differs from the baseline audio pronunciation, generating a phonetic transcription of the audio pronunciation of the selected audio segment; and adding the phonetic transcription and the textual phrase to the group of identified audio segments in the speech recognition library to populate the collected audio pronunciations of the selected audio segment. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. An apparatus, comprising
a memory to store instructions; - and
a processor coupled to the memory, wherein responsive to executing the instructions, the processor performs operations comprising; identifying an audio data segment of a video data segment associated with a closed caption textual phrase from the audio data segment; selecting from a speech recognition library a baseline audio pronunciation associated with the textual phrase, wherein the speech recognition library comprises the baseline audio pronunciation and collected audio pronunciations of the textual phrase; calculating a difference metric between the audio data segment and the baseline audio pronunciation associated with the textual phrase from a speech recognition library; determining a first phonetic transcription of the audio data segment responsive to the difference metric indicating a difference between the audio data segment and the baseline audio pronunciation; determining that the first phonetic transcription differs from a baseline phonetic transcription of the baseline audio pronunciation; and responsive to the determining that the first phonetic transcription differs from the baseline phonetic transcription, populating the collected audio pronunciations for the textual phrase in the speech recognition library, wherein the collected audio pronunciation includes the first phonetic transcription and the audio data segment of the textual phrase. - View Dependent Claims (9, 10)
- and
-
11. A non-transitory machine-readable storage medium, comprising instructions, wherein responsive to executing the instructions, a processor performs operations comprising:
-
comparing audio segments matched to a textual phrase in closed captioning from a video content stream to a baseline audio pronunciation, wherein the baseline audio pronunciation is selected from a speech recognition library and is associated with the textual phrase in the speech recognition library, wherein the speech recognition library comprises the baseline audio pronunciation and collected audio pronunciations of the textual phrase; identifying one of the audio segments having a pronunciation that differs from the baseline audio pronunciation of the textual phrase as an identified audio segment of the textual phrase; generating a phonetic transcription of the pronunciation of the identified audio segment from close captioning data of the video content stream; and adding the phonetic transcription of the pronunciation of the identified audio segment to the speech recognition library to the collected audio pronunciations of the textual phrase to populate the speech recognition library. - View Dependent Claims (12, 13)
-
-
14. A method, comprising:
-
causing a processor to perform a phonetic transcription of an audio segment corresponding to a textual phrase in closed captioning from a video media source responsive to detecting a difference in pronunciation between the audio segment and a baseline audio pronunciation associated with the textual phrase from a speech recognition library, wherein the speech recognition library comprises the baseline audio pronunciation and collected audio pronunciations of the textual phrase, and wherein the video media source comprises image data, audio data, and closed captioning data; and responsive to detecting a difference in the pronunciation; causing the processor to store the phonetic transcription of the audio segment in the speech recognition library as one of the collected audio pronunciations to populate the speech recognition library; and adding the audio segment corresponding to the textual phrase to the speech recognition library to populate the speech recognition library. - View Dependent Claims (15, 16, 17, 18)
-
Specification