Methods and systems for performing synchronization of audio with corresponding textual transcriptions and determining confidence values of the synchronization
First Claim
1. A method of processing audio signals, comprising:
- receiving an audio signal comprising vocal elements;
a processor performing a forward alignment of the vocal elements in a forward direction with corresponding textual transcriptions of the vocal elements;
based on the forward alignment, determining forward timing boundary information associated with an elapsed amount of time for a duration of a portion of the vocal elements processed in the forward direction;
the processor performing a reverse alignment of the vocal elements processed in a reverse direction with corresponding reverse textual transcriptions of the vocal elements;
determining reverse timing boundary information associated with an elapsed amount of time for a duration of the portion of the vocal elements processed in the reverse direction;
processing the reverse alignment of the vocal elements and the reverse timing boundary information so as to provide the reverse alignment and the reverse timing boundary information in a forward direction and establish a second forward alignment of the vocal elements and a second forward timing boundary information; and
based on a comparison between the forward timing boundary information and the second forward timing boundary information, outputting a confidence metric indicating a level of certainty for at least one of the forward timing boundary information and the second forward timing boundary information.
2 Assignments
0 Petitions
Accused Products
Abstract
Methods and systems for performing audio synchronization with corresponding textual transcription and determining confidence values of the timing-synchronization are provided. Audio and a corresponding text (e.g., transcript) may be synchronized in a forward and reverse direction using speech recognition to output a time-annotated audio-lyrics synchronized data. Metrics can be computed to quantify and/or qualify a confidence of the synchronization. Based on the metrics, example embodiments describe methods for enhancing an automated synchronization process to possibly adapted Hidden Markov Models (HMMs) to the synchronized audio for use during the speech recognition. Other examples describe methods for selecting an appropriate HMM for use.
-
Citations
27 Claims
-
1. A method of processing audio signals, comprising:
-
receiving an audio signal comprising vocal elements; a processor performing a forward alignment of the vocal elements in a forward direction with corresponding textual transcriptions of the vocal elements; based on the forward alignment, determining forward timing boundary information associated with an elapsed amount of time for a duration of a portion of the vocal elements processed in the forward direction; the processor performing a reverse alignment of the vocal elements processed in a reverse direction with corresponding reverse textual transcriptions of the vocal elements; determining reverse timing boundary information associated with an elapsed amount of time for a duration of the portion of the vocal elements processed in the reverse direction; processing the reverse alignment of the vocal elements and the reverse timing boundary information so as to provide the reverse alignment and the reverse timing boundary information in a forward direction and establish a second forward alignment of the vocal elements and a second forward timing boundary information; and based on a comparison between the forward timing boundary information and the second forward timing boundary information, outputting a confidence metric indicating a level of certainty for at least one of the forward timing boundary information and the second forward timing boundary information. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
-
-
18. A non-transitory computer readable storage medium having stored therein instructions executable by a computing device to cause the computing device to perform functions of:
-
receiving an audio signal comprising vocal elements; performing a forward alignment of the vocal elements in a forward direction with corresponding textual transcriptions of the vocal elements; based on the forward alignment, determining forward timing boundary information associated with an elapsed amount of time for a duration of a portion of the vocal elements processed in the forward direction; the processor performing a reverse alignment of the vocal elements processed in a reverse direction with corresponding reverse textual transcriptions of the vocal elements; determining reverse timing boundary information associated with an elapsed amount of time for a duration of the portion of the vocal elements processed in the reverse direction; processing the reverse alignment of the vocal elements and the reverse timing boundary information so as to provide the reverse alignment and the reverse timing boundary information in a forward direction and establish a second forward alignment of the vocal elements and a second forward timing boundary information; and based on a comparison between the forward timing boundary information and the second forward timing boundary information, outputting a confidence metric indicating a level of certainty for at least one of the forward timing boundary information and the second forward timing boundary information. - View Dependent Claims (19, 20, 21, 22, 23)
-
-
24. A system comprising:
-
a Hidden Markov Model (HMM) database that includes phonetic modeling of words; a pronunciation dictionary database that includes grammars representing words; and a speech decoder configured to receive an audio signal and access the HMM to map vocal elements in the audio signal to phonetic descriptions and access the pronunciation dictionary database to map the phonetic descriptions to grammars, the speech decoder further configured to perform a forward alignment of the grammars with corresponding textual transcriptions of the vocal elements in a forward direction and a reverse alignment of the vocal elements processed in a reverse direction with corresponding reverse textual transcriptions of the vocal elements, wherein the speech decoder is configured to determine forward timing boundary information associated with an elapsed amount of time for a duration of a portion of the vocal elements processed in the forward direction and reverse timing boundary information associated with an elapsed amount of time for a duration of the portion of the vocal elements processed in the reverse direction, and the speech decoder is configured to process the reverse alignment of the vocal elements and the reverse timing boundary information so as to provide the reverse alignment and the reverse timing boundary information in a forward direction and establish a second forward alignment of the vocal elements and a second forward timing boundary information, and the speech decoder is configured to determine based on a comparison between the forward timing boundary information and the second forward timing boundary information a confidence metric indicating a level of certainty for at least one of the forward timing boundary information and the second forward timing boundary information for the duration of the portion of the vocal elements. - View Dependent Claims (25, 26, 27)
-
Specification