Method for aligning text with audio signals

US 6,076,059 A
Filed: 08/29/1997
Issued: 06/13/2000
Est. Priority Date: 08/29/1997
Status: Expired due to Fees

First Claim

Patent Images

1. A computerized method for aligning text segments of a text file with audio segments of an audio file, comprising the steps of:

generating a vocabulary and language model from the text file, generation of said model involving determination of relative probabilities of all one, two, and three word sequences in all unaligned text segments of the text file based upon frequencies of occurrences of said sequences in said unaligned text segments, all of said text segments being initially classified as unaligned text segments;

recognizing a word list from the audio segments using the vocabulary and language model but without considering the text file;

aligning the word list with the text segments based upon respective scores for all possible alignments of words in the word list with the text segments, each respective score being weighted to increase each respective score by a relatively greater amount if a respective alignment associated with the respective score involves relatively longer sequences of correctly aligned words;

choosing corresponding anchors in the word list and text segments in accordance with the respective scores;

partitioning the text and the audio segments into unaligned and aligned text and audio segments according to the anchors; and

repeating the generating, recognizing, aligning, choosing, and partitioning steps with the unaligned text and audio segments until a termination condition is reached.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

In a computerized method, text segments of a text file are aligned with audio segments of an audio file. The text file includes written words, and the audio file includes spoken words. A vocabulary and language model are generated from the text segment. A word list is recognized from the audio segment using the vocabulary and language model. The word list is aligned with the text segment, and corresponding anchors are chosen in the word list and text segment. Using the anchors, the text segment and the audio segment are partitioned into unaligned and aligned segments according to the anchors. These steps are repeated for any unaligned segments until a termination condition is reached.

Citations

22 Claims

1. A computerized method for aligning text segments of a text file with audio segments of an audio file, comprising the steps of:
- generating a vocabulary and language model from the text file, generation of said model involving determination of relative probabilities of all one, two, and three word sequences in all unaligned text segments of the text file based upon frequencies of occurrences of said sequences in said unaligned text segments, all of said text segments being initially classified as unaligned text segments;
  
  recognizing a word list from the audio segments using the vocabulary and language model but without considering the text file;
  
  aligning the word list with the text segments based upon respective scores for all possible alignments of words in the word list with the text segments, each respective score being weighted to increase each respective score by a relatively greater amount if a respective alignment associated with the respective score involves relatively longer sequences of correctly aligned words;
  
  choosing corresponding anchors in the word list and text segments in accordance with the respective scores;
  
  partitioning the text and the audio segments into unaligned and aligned text and audio segments according to the anchors; and
  
  repeating the generating, recognizing, aligning, choosing, and partitioning steps with the unaligned text and audio segments until a termination condition is reached.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21)
- - 2. The method of claim 1 wherein the text segments correspond to the entire text file, and the audio segments correspond to the entire audio file.
  - 3. The method of claim 1 wherein the language model is in the form of trigrams.
  - 4. The method of claim 1 wherein the word list includes a sequential list of recognized spoken words of the audio segments.
  - 5. The method of claim 1 further including the step of annotating the word list with timing information.
  - 6. The method of claim 4 wherein the timing information includes the starting time and duration of each recognized spoken word.
  - 7. The method of claim 1 wherein the word list is recognized using acoustic-phonetic models of a speech recognizer.
  - 8. The method of claim 1 further including the steps of determining a plurality of possible alignments, scoring each possible alignment, and selecting a best alignment using dynamic programming.
  - 9. The method of claim 8 further including the step of increasing the score of a particular possible alignment when the particular possible alignment includes a relatively long sequence of correctly aligned words.
  - 10. The method of claim 5 further including the step of annotating the text segments with the timing information for correctly aligned words of the audio segments.
  - 11. The method of claim 1 wherein the termination condition occurs when all of the text and audio segments are fully aligned.
  - 12. The method of claim 1 wherein the termination condition is a detection of all anchors in the text and audio segments.
  - 13. The method of claim 1 wherein the termination condition is reached for a particular unaligned segment if the particular unaligned segment has a duration less than a predetermined threshold.
  - 14. The method of claim 1 wherein the unaligned segments include portions of adjacent aligned segments.
  - 15. The method of claim 1 wherein the vocabulary and language model is rebuilt from the unaligned segments during repeated iterations of the generating step.
  - 16. The method of claim 1 wherein the text segments correspond to a portion of the audio segments.
  - 17. The method of claim 7 wherein the acoustic-phonetic models are dynamically adjusted during repeated iterations according to the aligned segments.
  - 18. The method of claim 7 wherein the acoustic-phonetic models are dynamically adjusted during repeated iterations to take into consideration channel characteristics.
  - 19. The method of claim 17 wherein the acoustic-phonetic models initially are speaker independent, and the acoustic-phonetic models are adjusted to be speaker dependent during the iterations.
  - 20. The method of claim 19 wherein the text segments are annotated with speaker identification information for a plurality of speakers, and further including the step of adjusting the acoustic-phonetic models to include a plurality of speaker dependent models.
  - 21. The method of claim 20 wherein corresponding text segments are generated from particular audio segments using the plurality of speaker dependent models.

22. An apparatus for aligning text segments of a text file with audio segments of an audio file, comprising:
- an analyzer for analyzing the text segments to generate a vocabulary and language mode, generation of said model involving determination of relative probabilities of all one, two, and three word sequences in all unaligned text segments of the text file based upon frequencies of occurrences of said sequences in said unaligned text segments, all of the text segments of the text file being initially classified as unaligned text segments;
  
  a speech recognizer for generating a word list from the audio segments using the vocabulary and language model but without considering the text file;
  
  an aligner for aligning the word list with the text segments based upon respective scores for all possible alignments of words in the word list with the text segments, each respective score being weighted to increase each respective score by a relatively greater amount if a respective alignment associated with the respective score involves relatively longer sequences of correctly aligned words;
  
  an anchor choosing mechanism for choosing corresponding anchors in the word list and text segments in accordance with the respective scores;
  
  a partitioning mechanism for partitioning the text and the audio segments into unaligned and aligned segments according to the anchors; and
  
  a repetition mechanism for repeating the generating, recognizing, aligning, choosing, and partitioning steps with the unaligned segments until a termination condition is reached.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Hewlett-Packard Development Company, L.P. (HP Inc.)
Original Assignee
Digital Equipment Corporation (HP Inc.)
Inventors
Glickman, Oren, Joerg, Christopher Frank
Primary Examiner(s)
Zele, Krista
Assistant Examiner(s)
Opsasnick, Michael N.

Application Number

US08/921,347
Time in Patent Office

1,019 Days
Field of Search

704/278, 704/251, 704/260, 704/231
US Class Current

704/260
CPC Class Codes

G10L 15/26 Speech to text systems G10L...

Method for aligning text with audio signals

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

Citations

22 Claims

Specification

Solutions

Use Cases

Quick Links

Method for aligning text with audio signals

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

22 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links