Synchronization of an input text of a speech with a recording of the speech

US 8,209,169 B2
Filed: 10/24/2011
Issued: 06/26/2012
Est. Priority Date: 06/28/2007
Status: Active Grant

First Claim

Patent Images

1. A method for synchronizing words in an input text of a speech with a recording of the speech, comprising:

performing, by a processor of a computer system, speech recognition of input speech data representing the speech, by comparing the input speech data with pronunciation speech data associated with the input text, to generate a recognition text comprising recognized words of the input text;

determining, by the processor of the computer system, by comparing the input text with the recognition text, an erroneous recognition text comprising words of the input text not matching respective words of the recognition text;

generating, by the processor of the computer system, synthetic speech data corresponding to the erroneous recognition text;

computing, by the processor of the computer system, from the input speech data to which each word of the synthetic speech data corresponds, ratio data comprising a ratio of a pronunciation time in the input speech data of each word of the erroneous recognition text to a pronunciation time in the input speech data of each other word of the erroneous recognition text; and

determining, by the processor of the computer system, based on the computed ratio data, an association between each word of the erroneous recognition text and a time to reproduce the input speech data corresponding to said each word of the erroneous recognition text.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and system for synchronizing words in an input text of a speech with a continuous recording of the speech. A received input text includes previously recorded content of the speech to be reproduced. A synthetic speech corresponding to the received input text is generated. Ratio data including a ratio between the respective pronunciation times of words included in the received text in the generated synthetic speech is computed. The ratio data is used to determine an association between erroneously recognized words of the received text and a time to reproduce each erroneously recognized word. The association is outputted in a recording medium and/or displayed on a display device.

Citations

21 Claims

1. A method for synchronizing words in an input text of a speech with a recording of the speech, comprising:
- performing, by a processor of a computer system, speech recognition of input speech data representing the speech, by comparing the input speech data with pronunciation speech data associated with the input text, to generate a recognition text comprising recognized words of the input text;
  
  determining, by the processor of the computer system, by comparing the input text with the recognition text, an erroneous recognition text comprising words of the input text not matching respective words of the recognition text;
  
  generating, by the processor of the computer system, synthetic speech data corresponding to the erroneous recognition text;
  
  computing, by the processor of the computer system, from the input speech data to which each word of the synthetic speech data corresponds, ratio data comprising a ratio of a pronunciation time in the input speech data of each word of the erroneous recognition text to a pronunciation time in the input speech data of each other word of the erroneous recognition text; and
  
  determining, by the processor of the computer system, based on the computed ratio data, an association between each word of the erroneous recognition text and a time to reproduce the input speech data corresponding to said each word of the erroneous recognition text.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method of claim 1, further comprising:
    - generating a first dictionary stored in a first dictionary database, said first dictionary comprising the words in the input text and the associated pronunciation speech data.
  - 3. The method of claim 2, wherein said generating the first dictionary comprises:
    - providing a basic dictionary stored in a basic dictionary database of the computer system, said basic dictionary comprising words and pronunciation data associated with each word of the basic dictionary for speaking each word of the basic dictionary;
      
      comparing the words in the input text with the words in the basic dictionary to determine words in the input text matching same words in the basic dictionary;
      
      for each matched word in the input text matching a same word in the basic dictionary, inserting the same word and the associated pronunciation data in the first dictionary,wherein the words in the first dictionary comprise each inserted same word,wherein the first pronunciation speech data in the first dictionary comprises the inserted associated pronunciation data.
  - 4. The method of claim 3, wherein said generating the first dictionary further comprises:
    - for each unmatched word in the input text not matching any word in the basic dictionary, generating associated synthetic speech data and inserting each unmatched word with its associated synthetic speech data in the first dictionary,wherein the words in the first dictionary further comprise each inserted unmatched word,wherein the first pronunciation speech data in the first dictionary further comprises the inserted associated synthetic speech data.
  - 5. The method of claim 1, further comprising:
    - generating, from analysis of the input speech data, time stamp data comprising a starting time and an ending time in the input speech data at which the pronunciation speech data for each word of the recognition text was spoken by the speaker; and
      
      generating, through use of the generated time stamp data, the input speech data corresponding to the erroneous recognition text.
  - 6. The method of claim 1, further comprising:
    - recording the association in a recording medium and/or displaying the association on a display device.
  - 7. The method of claim 1, further comprising:
    - performing speech recognition of the input speech data corresponding to the erroneous recognition text to generate a second recognition text, determining, by comparing the second recognition text with the erroneous recognition text, a second erroneous recognition text, and computing the ratio data based at least in part on the second erroneous recognition text.

8. A computer program product, comprising a computer-readable storage device having a computer-readable program code stored therein, said computer-readable program code containing instructions that, when executed by a processor of a computer system, implement a method for synchronizing words in an input text of a speech with a recording of the speech, said method comprising:
- performing speech recognition of input speech data representing the speech, by comparing the input speech data with pronunciation speech data associated with the input text, to generate a recognition text comprising recognized words of the input text;
  
  determining, by comparing the input text with the recognition text, an erroneous recognition text comprising words of the input text not matching respective words of the recognition text;
  
  generating synthetic speech data corresponding to the erroneous recognition text;
  
  computing, from the input speech data to which each word of the synthetic speech data corresponds, ratio data comprising a ratio of a pronunciation time in the input speech data of each word of the erroneous recognition text to a pronunciation time in the input speech data of each other word of the erroneous recognition text; and
  
  determining, based on the computed ratio data, an association between each word of the erroneous recognition text and a time to reproduce the input speech data corresponding to said each word of the erroneous recognition text.
- View Dependent Claims (9, 10, 11, 12, 13)
- - 9. The computer program product of claim 8, further comprising:
    - generating a first dictionary stored in a first dictionary database, said first dictionary comprising the words in the input text and the associated pronunciation speech data.
  - 10. The computer program product of claim 9, wherein said generating the first dictionary comprises:
    - providing a basic dictionary stored in a basic dictionary database of the computer system, said basic dictionary comprising words and pronunciation data associated with each word of the basic dictionary for speaking each word of the basic dictionary;
      
      comparing the words in the input text with the words in the basic dictionary to determine words in the input text matching same words in the basic dictionary;
      
      for each matched word in the input text matching a same word in the basic dictionary, inserting the same word and the associated pronunciation data in the first dictionary,wherein the words in the first dictionary comprise each inserted same word,wherein the first pronunciation speech data in the first dictionary comprises the inserted associated pronunciation data.
  - 11. The computer program product of claim 10, wherein generating the first dictionary further comprises:
    - for each unmatched word in the input text not matching any word in the basic dictionary, generating associated synthetic speech data and inserting each unmatched word with its associated synthetic speech data in the first dictionary,wherein the words in the first dictionary further comprise each inserted unmatched word,wherein the first pronunciation speech data in the first dictionary further comprises the inserted associated synthetic speech data.
  - 12. The computer program product of claim 8, further comprising:
    - generating, from analysis of the input speech data, time stamp data comprising a starting time and an ending time in the input speech data at which the pronunciation speech data for each word of the recognition text was spoken by the speaker; and
      
      generating, through use of the generated time stamp data, the input speech data corresponding to the erroneous recognition text.
  - 13. The computer program product of claim 8, further comprising:
    - recording the association in a recording medium and/or displaying the association on a display device.

14. The computer program product of clam 8, further comprising:
- performing speech recognition of the input speech data corresponding to the erroneous recognition text to generate a second recognition text, determining, by comparing the second recognition text with the erroneous recognition text, a second erroneous recognition text, and computing the ratio data based at least in part on the second erroneous recognition text.

15. A computer system comprising:
- a processor and a computer-readable memory unit coupled to the processor, said memory unit containing instructions that, when executed by the processor, implement a method for synchronizing words in an input text of a speech with a recording of the speech, said method comprising;
  
  performing speech recognition of input speech data representing the speech, by comparing the input speech data with pronunciation speech data associated with the input text, to generate a recognition text comprising recognized words of the input text;
  
  determining, by comparing the input text with the recognition text, an erroneous recognition text comprising words of the input text not matching respective words of the recognition text;
  
  generating synthetic speech data corresponding to the erroneous recognition text;
  
  computing, from the input speech data to which each word of the synthetic speech data corresponds, ratio data comprising a ratio of a pronunciation time in the input speech data of each word of the erroneous recognition text to a pronunciation time in the input speech data of each other word of the erroneous recognition text; and
  
  determining, based on the computed ratio data, an association between each word of the erroneous recognition text and a time to reproduce the input speech data corresponding to said each word of the erroneous recognition text.
- View Dependent Claims (16, 17, 18, 19, 20, 21)
- - 16. The computer system of claim 15, further comprising:
    - generating a first dictionary stored in a first dictionary database, said first dictionary comprising the words in the input text and the associated pronunciation speech data.
  - 17. The computer system of claim 16, wherein said generating the first dictionary comprises:
    - providing a basic dictionary stored in a basic dictionary database of the computer system, said basic dictionary comprising words and pronunciation data associated with each word of the basic dictionary for speaking each word of the basic dictionary;
      
      comparing the words in the input text with the words in the basic dictionary to determine words in the input text matching same words in the basic dictionary;
      
      for each matched word in the input text matching a same word in the basic dictionary, inserting the same word and the associated pronunciation data in the first dictionary,wherein the words in the first dictionary comprise each inserted same word,wherein the first pronunciation speech data in the first dictionary comprises the inserted associated pronunciation data.
  - 18. The computer system of claim 17, wherein said generating the first dictionary further comprises:
    - for each unmatched word in the input text not matching any word in the basic dictionary, generating associated synthetic speech data and inserting each unmatched word with its associated synthetic speech data in the first dictionary,wherein the words in the first dictionary further comprise each inserted unmatched word,wherein the first pronunciation speech data in the first dictionary further comprises the inserted associated synthetic speech data.
  - 19. The computer system of claim 15, further comprising:
    - generating, from analysis of the input speech data, time stamp data comprising a starting time and an ending time in the input speech data at which the pronunciation speech data for each word of the recognition text was spoken by the speaker; and
      
      generating, through use of the generated time stamp data, the input speech data corresponding to the erroneous recognition text.
  - 20. The computer system of claim 15, further comprising:
    - recording the association in a recording medium and/or displaying the association on a display device.
  - 21. The computer system of claim 15, further comprising:
    - performing speech recognition of the input speech data corresponding to the erroneous recognition text to generate a second recognition text, determining, by comparing the second recognition text with the erroneous recognition text, a second erroneous recognition text, and computing the ratio data based at least in part on the second erroneous recognition text.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Inventors
Imoto, Noriko, Uda, Tetsuya, Watanabe, Takatoshi
Primary Examiner(s)
Vo, Huyen X.

Application Number

US13/279,479
Publication Number

US 20120041758A1
Time in Patent Office

246 Days
Field of Search

704/231, 704/235, 704/260, 704/258, 704246-255, 704/270, 704/270.1
US Class Current

704/231
CPC Class Codes

G10L 13/00 Speech synthesis; Text to s...

G10L 15/26 Speech to text systems G10L...

Synchronization of an input text of a speech with a recording of the speech

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

Citations

21 Claims

Specification

Solutions

Use Cases

Quick Links

Synchronization of an input text of a speech with a recording of the speech

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

21 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links