Method and apparatus for aligning texts

US 8,527,272 B2
Filed: 08/27/2010
Issued: 09/03/2013
Est. Priority Date: 08/28/2009
Status: Active Grant

First Claim

Patent Images

1. A method for aligning texts, said method comprising:

acquiring a target text and a reference text,said target text comprising recognized text acquired by performing speech recognition on speech data, andsaid reference text being associated with said speech data; and

aligning said target text and said reference text at a word level based on phoneme similarity between phonemes in words in said target text and corresponding phonemes in corresponding words in said reference text, said aligning comprising;

parsing said phonemes of said words in said target text and said corresponding phonemes of said corresponding words in said reference text; and

computing a path penalty value by using a dynamic time warping DTW algorithm with said phoneme similarity, and finding a best path matching said target text and said reference text with said path penalty value, said aligning of said target text and said reference text being based on said best path, andsaid acquiring and said aligning being performed by a programmed data processing system.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and apparatus for aligning texts. The method includes acquiring a target text and a reference text and aligning the target text and the reference text at word level based on phoneme similarity. The method can be applied to automatically archiving a multimedia resource and a method of automatically searching a multimedia resource.

Citations

24 Claims

1. A method for aligning texts, said method comprising:
- acquiring a target text and a reference text,said target text comprising recognized text acquired by performing speech recognition on speech data, andsaid reference text being associated with said speech data; and
  
  aligning said target text and said reference text at a word level based on phoneme similarity between phonemes in words in said target text and corresponding phonemes in corresponding words in said reference text, said aligning comprising;
  
  parsing said phonemes of said words in said target text and said corresponding phonemes of said corresponding words in said reference text; and
  
  computing a path penalty value by using a dynamic time warping DTW algorithm with said phoneme similarity, and finding a best path matching said target text and said reference text with said path penalty value, said aligning of said target text and said reference text being based on said best path, andsaid acquiring and said aligning being performed by a programmed data processing system.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The method according to claim 1,said speech data comprising any of audio data and video data acquired during an event, andsaid reference text being presented during said event and comprising any of closed-captioning text and presentation materials text.
  - 3. The method according to claim 2, further comprising establishing a temporal link between said speech data and said reference text based on alignment of said target text and said reference text.
  - 4. The method according to claim 1, said path penalty value comprising a sum of respective penalty values for each step of a path and each penalty value for each step of said path being computed as follows:
    - said penalty value is 0 for same words;
      
      said penalty value of a substitution error corresponds to a pronunciation similarity of two words, said pronunciation similarity being based on corresponding phoneme similarity; and
      
      said penalty value of any one of an insertion error and a deletion error is constant.
  - 5. The method according to claim 1, said phoneme similarity being predetermined.
  - 6. The method according to claim 1, said phoneme similarity being measured by acoustic model distance of phonemes.
  - 7. The method according to claim 6, said acoustic model distance comprises one of Euclidean distance, Mahalanobis Distance, and Bhattacharyya distance.
  - 8. The method according to claim 1, further comprising, before said aligning of said target text and said reference text at said word level, aligning said target text and said reference text at a paragraph level based on perplexity.
  - 9. The method according to claim 8, said aligning of said target text and said reference text at said paragraph level comprising:
    - establishing a language model for each paragraph in said reference text;
      
      computing perplexity scores for possible mappings for each sentence to each paragraph in said target text based on said language model; and
      
      selecting a mapping result with a low perplexity score to map each sentence in said target text to a paragraph in said target text.
  - 10. The method according to claim 9, said aligning of said target text and said reference text at said paragraph level based on said perplexity further comprising smoothing said mapping result.
  - 11. The method according to claim 1, further comprising, before said aligning of said target text and said reference text at said word level, performing successive word string matching between said target text and reference text to determine anchors in order to segment said target text and said reference text into smaller segments.

12. An apparatus for aligning texts, said apparatus comprising:
- an input acquiring a target text and a reference text,said target text comprising recognized text acquired by performing speech recognition on speech data, andsaid reference text being associated with said speech data; and
  
  a data processing machine operatively connected to said input and comprising a word alignment module aligning said target text and said reference text at a word level based on phoneme similarity between phonemes in words in said target text and corresponding phonemes in corresponding words in said reference text, said word alignment module comprising the following for performing said aligning of said target text and said reference text at said word level based on said phoneme similarity;
  
  a parsing module parsing said phonemes in said words in said target text and said corresponding phonemes in said corresponding words in said reference text;
  
  a dynamic time warping DTW module computing a path penalty value by using a dynamic time warping DTW algorithm with said phoneme similarity, and finding a best path matching said target text and said reference text with said path penalty value; and
  
  an alignment sub-module aligning said target text and said reference text based on said best path.
- View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20, 21, 22)
- - 13. The apparatus according to claim 12,said speech data comprising any of audio data and video data acquired during an event, andsaid reference text being presented during said event and comprising any of closed-captioning text and presentation materials text.
  - 14. The apparatus according to claim 12, said data processing machine further comprising a link module for establishing a temporal link between said speech data and said reference text based on alignment of said target text and said reference text.
  - 15. The apparatus according to claim 12, said path penalty value comprising a sum of respective penalty values for each step of a path and each penalty value for each step of said path computed as follows:
    - said penalty value is 0 for same words;
      
      said penalty value of a substitution error corresponds to a pronunciation similarity of two words, said pronunciation similarity being based on corresponding phoneme similarity; and
      
      said penalty value of any one of an insertion error and a deletion error is constant.
  - 16. The apparatus according to claim 12, said phoneme similarity being predetermined.
  - 17. The apparatus according to claim 12, said phoneme similarity being measured by acoustic models distance of phonemes.
  - 18. The apparatus according to claim 17, said distance comprising one of Euclidean distance, Mahalanobis Distance, and Bhattacharyya distance.
  - 19. The apparatus according to claim 12, said data processing machine further comprising a paragraph alignment module aligning said target text and reference text at a paragraph level based on perplexity before said aligning of said target text and said reference text at said word level.
  - 20. The apparatus according to claim 19, said paragraph alignment module comprising the following for performing said aligning of said target text and said reference text at said paragraph level:
    - a language model module for establishing a language model for each paragraph in said reference text;
      
      a computing perplexity module computing perplexity scores for possible mappings for each sentence to each paragraph in said target text based on said language model; and
      
      a mapping module selecting a mapping result with a low perplexity score to map each sentence in target text to a paragraph in said target text.
  - 21. The apparatus according to claim 20, said paragraph alignment module further comprising a smoothing module smoothing said mapping result.
  - 22. The apparatus according to claim 21, said data processing machine further comprising an anchor determining module performing successive word string matching between said target text and said reference text to determine anchors in order to segment said target text and said reference text into smaller segments.

23. A method for archiving a multimedia resource, said method comprising:
- acquiring an original multimedia resource and a reference text, said original multimedia resource comprising speech data and said reference text being associated with said speech data;
  
  recognizing said speech data to generate a target text, said recognizing comprising performing a speech recognition process on said speech data;
  
  aligning said target text and said reference text at a word level based on phoneme similarity between phonemes in words in said target text and corresponding phonemes in corresponding words in said reference text, said aligning comprising;
  
  parsing said phonemes of said words in said target text and said corresponding phonemes of said corresponding words in said reference text; and
  
  computing a path penalty value by using a dynamic time warping DTW algorithm with said phoneme similarity, and finding a best path matching said target text and said reference text with said path penalty value, said aligning of said target text and said reference text being based on said best path;
  
  establishing a temporal link between said speech data and said reference text based on alignment of said target text and said reference text; and
  
  adding said temporal link to said original multimedia resource to generate a new multimedia resource archive file,said acquiring, said recognizing, said aligning, said establishing, and said adding being performed by a programmed data processing machine.

24. A method for searching a multimedia resource, said multimedia resource comprising speech data and reference text associated with said speech data and said method comprising:
- acquiring a key word for search;
  
  acquiring a reference text and a target text,said target text comprising recognized text acquired by performing speech recognition on speech data from a multimedia resource,said reference text being associated with said speech data,said reference text and said target text being aligned at a word level based on phoneme similarity between phonemes in words in said target text and corresponding phonemes in corresponding words in said reference text, wherein alignment of said reference text and said target text having been performed by;
  
  parsing said phonemes of said words in said target text and said corresponding phonemes of said corresponding words in said reference text;
  
  computing a path penalty value by using a dynamic time warping DTW algorithm with said phoneme similarity, and finding a best path matching said target text and said reference text with said path penalty value; and
  
  aligning said target text and said reference text being based on said best path, andsaid reference text and said speech data further having an established temporal link based on the alignment;
  
  searching for and identifying a location of said key word in said reference text; and
  
  locating a part of said multimedia resource corresponding to said key word in based on said location of said key word in said reference text and based on said temporal link;
  
  said acquiring of said key word, said acquiring of said multimedia resource, said searching and identifying, and said locating being performed by a programmed data processing machine.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Qin, Yong, Shi, Qin, Shuang, Zhiwei, Zhang, Shi Lei, Zhou, Jie
Primary Examiner(s)
JACKSON, JAKIEDA R

Application Number

US12/869,921
Publication Number

US 20110054901A1
Time in Patent Office

1,103 Days
Field of Search

704/254
US Class Current

704/254
CPC Class Codes

G06F 40/45 Example-based machine trans...

Method and apparatus for aligning texts

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

24 Claims

Specification

Solutions

Use Cases

Quick Links

Method and apparatus for aligning texts

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

24 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links