Synchronization of an input text of a speech with a recording of the speech

US 8,065,142 B2
Filed: 06/25/2008
Issued: 11/22/2011
Est. Priority Date: 06/28/2007
Status: Active Grant

First Claim

Patent Images

1. A method for synchronizing words in an input text of a speech with a continuous recording of the speech, said method implemented by execution of instructions by a processor of a computer system, said instructions being stored on computer readable storage media of the computer system, said method comprising:

generating a first dictionary stored in a first dictionary database of the computer system, said first dictionary comprising the words in the input text and associated first pronunciation speech data;

receiving input speech data encompassing the speech and being structured as a waveform obtained from the continuous recording of the speech spoken by a speaker reading the speech;

performing a first speech recognition of the input speech data, by comparing the input speech data with the first pronunciation speech data in the first dictionary, to generate a first recognition text comprising recognized words of the input text;

determining, by the processor of the computer system, from comparing the input text with the first recognition text, first erroneous recognition text comprising words of the input text erroneously recognized during performing the first speech recognition and not matching respective words of the first recognition text;

performing a second speech recognition of a first portion of the input speech data, corresponding to the first erroneous recognition text, to generate a second recognition text comprising recognized words of the first portion of the input speech data;

determining, by the processor of the computer system, from comparing the second recognition text with the first erroneous recognition text, second erroneous recognition text comprising words of the first erroneous recognition text differing from the words of second recognition text;

generating synthetic speech data corresponding to the second erroneous recognition text;

determining a second portion of the input speech data to which each word of the synthetic speech data corresponds;

computing, from the second portion of the input speech data to which each word of the synthetic speech data corresponds, ratio data comprising a ratio of a pronunciation time in the input speech data of each word of the second erroneous recognition text to a pronunciation time in the input speech data of each other word of the second erroneous recognition text;

determining, by the processor of the computer system, through use of the computed ratio data, a first association between each word of the second erroneous recognition text and a time to reproduce each portion of the input speech data corresponding to said each word of the second erroneous recognition text; and

recording the first association in a recording medium of the computer system and/or displaying the first association on a display device of the computer system.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and system for synchronizing words in an input text of a speech with a continuous recording of the speech. A received input text includes previously recorded content of the speech to be reproduced. A synthetic speech corresponding to the received input text is generated. Ratio data including a ratio between the respective pronunciation times of words included in the received text in the generated synthetic speech is computed. The ratio data is used to determine an association between erroneously recognized words of the received text and a time to reproduce each erroneously recognized word. The association is outputted in a recording medium and/or displayed on a display device.

Citations

24 Claims

1. A method for synchronizing words in an input text of a speech with a continuous recording of the speech, said method implemented by execution of instructions by a processor of a computer system, said instructions being stored on computer readable storage media of the computer system, said method comprising:
- generating a first dictionary stored in a first dictionary database of the computer system, said first dictionary comprising the words in the input text and associated first pronunciation speech data;
  
  receiving input speech data encompassing the speech and being structured as a waveform obtained from the continuous recording of the speech spoken by a speaker reading the speech;
  
  performing a first speech recognition of the input speech data, by comparing the input speech data with the first pronunciation speech data in the first dictionary, to generate a first recognition text comprising recognized words of the input text;
  
  determining, by the processor of the computer system, from comparing the input text with the first recognition text, first erroneous recognition text comprising words of the input text erroneously recognized during performing the first speech recognition and not matching respective words of the first recognition text;
  
  performing a second speech recognition of a first portion of the input speech data, corresponding to the first erroneous recognition text, to generate a second recognition text comprising recognized words of the first portion of the input speech data;
  
  determining, by the processor of the computer system, from comparing the second recognition text with the first erroneous recognition text, second erroneous recognition text comprising words of the first erroneous recognition text differing from the words of second recognition text;
  
  generating synthetic speech data corresponding to the second erroneous recognition text;
  
  determining a second portion of the input speech data to which each word of the synthetic speech data corresponds;
  
  computing, from the second portion of the input speech data to which each word of the synthetic speech data corresponds, ratio data comprising a ratio of a pronunciation time in the input speech data of each word of the second erroneous recognition text to a pronunciation time in the input speech data of each other word of the second erroneous recognition text;
  
  determining, by the processor of the computer system, through use of the computed ratio data, a first association between each word of the second erroneous recognition text and a time to reproduce each portion of the input speech data corresponding to said each word of the second erroneous recognition text; and
  
  recording the first association in a recording medium of the computer system and/or displaying the first association on a display device of the computer system.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method of claim 1, wherein said generating the first dictionary comprises:
    - providing a basic dictionary stored in a basic dictionary database of the computer system, said basic dictionary comprising words and pronunciation data associated with each word of the basic dictionary for speaking each word of the basic dictionary;
      
      comparing the words in the input text with the words in the basic dictionary to determine words in the input text matching same words in the basic dictionary;
      
      for each matched word in the input text matching a same word in the basic dictionary, inserting the same word and the associated pronunciation data in the first dictionary,wherein the words comprised by the first dictionary comprise each inserted same word,wherein the first pronunciation speech data in the first dictionary comprises the inserted associated pronunciation data.
  - 3. The method of claim 2, wherein said generating the first dictionary further comprises:
    - for each unmatched word in the input text not matching any word in the basic dictionary, generating associated synthetic speech data and inserting each unmatched word with its associated synthetic speech data in the first dictionary,wherein the words comprised by the first dictionary further comprise each inserted unmatched word,wherein the first pronunciation speech data in the first dictionary further comprises the inserted associated synthetic speech data.
  - 4. The method of claim 1, wherein the method further comprises:
    - generating, from analysis of the input speech data, first time stamp data comprising a starting time and an ending time in the input speech data at which the first pronunciation speech data for each word of the first recognition text was spoken by the speaker; and
      
      prior to performing the second speech recognition, generating, through use of the generated first time stamp data, the first portion of the input speech data corresponding to the first erroneous recognition text.
  - 5. The method of claim 1, wherein the method further comprise generating a second dictionary stored in a second dictionary database of the computer system, said generating the second dictionary comprising:
    - inserting words of the first erroneous recognition text into the second dictionary and further comprising reading the first dictionary from the first dictionary database;
      
      removing from the read first dictionary at least one word included in the input text but not included in the first erroneous recognition text; and
      
      inserting the removed at least one word in the second dictionary,wherein said performing the second speech recognition comprises utilizing the inserted words in the second dictionary in conjunction with the first pronunciation speech data associated with each inserted word in the second dictionary.
  - 6. The method of claim 1, wherein said generating synthetic speech data comprises:
    - matching the words of the second erroneous recognition text with same words in a speech synthesis dictionary stored in a database of the computer system; and
      
      associating second pronunciation speech data in the speech synthesis dictionary corresponding to the same words in the speech synthesis dictionary with the matched words of the second erroneous recognition text.
  - 7. The method of claim 1, wherein the method further comprises:
    - generating, from analysis of the recognized words of the first portion of the input speech data, second time stamp data comprising a starting time and an ending time in the first portion of the input speech data at which each word of the second recognition text was spoken by the speaker; and
      
      recording the second time stamp data in the recording medium of the computer system.
  - 8. The method of claim 1, wherein the method further comprises:
    - determining, through use of the first recognition text, the second recognition text, the computed ratio data, and the first association, a second association between each word of the input speech data and a starting time and/or an ending time of each word of the input speech data; and
      
      recording the second association in the recording medium of the computer system and/or displaying the second association on the display device of the computer system.

9. A computer program product, comprising a computer readable storage device having a computer readable program code stored therein, said computer readable program code containing instructions that when executed by a processor of a computer system implement a method for synchronizing words in an input text of a speech with a continuous recording of the speech, said method comprising:
- generating a first dictionary stored in a first dictionary database of the computer system, said first dictionary comprising the words in the input text and associated first pronunciation speech data;
  
  receiving input speech data encompassing the speech and being structured as a waveform obtained from the continuous recording of the speech spoken by a speaker reading the speech;
  
  performing a first speech recognition of the input speech data, by comparing the input speech data with the first pronunciation speech data in the first dictionary, to generate a first recognition text comprising recognized words of the input text;
  
  determining, from comparing the input text with the first recognition text, first erroneous recognition text comprising words of the input text erroneously recognized during performing the first speech recognition and not matching respective words of the first recognition text;
  
  performing a second speech recognition of a first portion of the input speech data, corresponding to the first erroneous recognition text, to generate a second recognition text comprising recognized words of the first portion of the input speech data;
  
  determining, from comparing the second recognition text with the first erroneous recognition text, second erroneous recognition text comprising words of the first erroneous recognition text differing from the words of second recognition text;
  
  generating synthetic speech data corresponding to the second erroneous recognition text;
  
  determining a second portion of the input speech data to which each word of the synthetic speech data corresponds;
  
  computing, from the second portion of the input speech data to which each word of the synthetic speech data corresponds, ratio data comprising a ratio of a pronunciation time in the input speech data of each word of the second erroneous recognition text to a pronunciation time in the input speech data of each other word of the second erroneous recognition text;
  
  determining, through use of the computed ratio data, a first association between each word of the second erroneous recognition text and a time to reproduce each portion of the input speech data corresponding to said each word of the second erroneous recognition text; and
  
  recording the first association in a recording medium of the computer system and/or displaying the first association on a display device of the computer system.
- View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
- - 10. The computer program product of claim 9, wherein said generating the first dictionary comprises:
    - providing a basic dictionary stored in a basic dictionary database of the computer system, said basic dictionary comprising words and pronunciation data associated with each word of the basic dictionary for speaking each word of the basic dictionary;
      
      comparing the words in the input text with the words in the basic dictionary to determine words in the input text matching same words in the basic dictionary;
      
      for each matched word in the input text matching a same word in the basic dictionary, inserting the same word and the associated pronunciation data in the first dictionary,wherein the words comprised by the first dictionary comprise each inserted same word,wherein the first pronunciation speech data in the first dictionary comprises the inserted associated pronunciation data.
  - 11. The computer program product of claim 10, wherein said generating the first dictionary further comprises:
    - for each unmatched word in the input text not matching any word in the basic dictionary, generating associated synthetic speech data and inserting each unmatched word with its associated synthetic speech data in the first dictionary,wherein the words comprised by the first dictionary further comprise each inserted unmatched word,wherein the first pronunciation speech data in the first dictionary further comprises the inserted associated synthetic speech data.
  - 12. The computer program product of claim 9, wherein the method further comprises:
    - generating, from analysis of the input speech data, first time stamp data comprising a starting time and an ending time in the input speech data at which the first pronunciation speech data for each word of the first recognition text was spoken by the speaker; and
      
      prior to performing the second speech recognition, generating, through use of the generated first time stamp data, the first portion of the input speech data corresponding to the first erroneous recognition text.
  - 13. The computer program product of claim 9, wherein the method further comprise generating a second dictionary stored in a second dictionary database of the computer system, said generating the second dictionary comprising:
    - inserting words of the first erroneous recognition text into the second dictionary and further comprising reading the first dictionary from the first dictionary database;
      
      removing from the read first dictionary at least one word included in the input text but not included in the first erroneous recognition text; and
      
      inserting the removed at least one word in the second dictionary,wherein said performing the second speech recognition comprises utilizing the inserted words in the second dictionary in conjunction with the first pronunciation speech data associated with each inserted word in the second dictionary.
  - 14. The computer program product of claim 9, wherein said generating synthetic speech data comprises:
    - matching the words of the second erroneous recognition text with same words in a speech synthesis dictionary stored in a database of the computer system; and
      
      associating second pronunciation speech data in the speech synthesis dictionary corresponding to the same words in the speech synthesis dictionary with the matched words of the second erroneous recognition text.
  - 15. The computer program product of claim 9, wherein the method further comprises:
    - generating, from analysis of the recognized words of the first portion of the input speech data, second time stamp data comprising a starting time and an ending time in the first portion of the input speech data at which each word of the second recognition text was spoken by the speaker; and
      
      recording the second time stamp data in the recording medium of the computer system.
  - 16. The computer program product of claim 9, wherein the method further comprises:
    - determining, through use of the first recognition text, the second recognition text, the computed ratio data, and the first association, a second association between each word of the input speech data and a starting time and/or an ending time of each word of the input speech data; and
      
      recording the second association in the recording medium of the computer system and/or displaying the second association on the display device of the computer system.

17. A computer system comprising a processor and a computer readable memory unit coupled to the processor, said memory unit containing instructions that when executed by the processor implement a method for synchronizing words in an input text of a speech with a continuous recording of the speech, said method comprising:
- generating a first dictionary stored in a first dictionary database of the computer system, said first dictionary comprising the words in the input text and associated first pronunciation speech data;
  
  receiving input speech data encompassing the speech and being structured as a waveform obtained from the continuous recording of the speech spoken by a speaker reading the speech;
  
  performing a first speech recognition of the input speech data, by comparing the input speech data with the first pronunciation speech data in the first dictionary, to generate a first recognition text comprising recognized words of the input text;
  
  determining, from comparing the input text with the first recognition text, first erroneous recognition text comprising words of the input text erroneously recognized during performing the first speech recognition and not matching respective words of the first recognition text;
  
  performing a second speech recognition of a first portion of the input speech data, corresponding to the first erroneous recognition text, to generate a second recognition text comprising recognized words of the first portion of the input speech data;
  
  determining, from comparing the second recognition text with the first erroneous recognition text, second erroneous recognition text comprising words of the first erroneous recognition text differing from the words of second recognition text;
  
  generating synthetic speech data corresponding to the second erroneous recognition text;
  
  determining a second portion of the input speech data to which each word of the synthetic speech data corresponds;
  
  computing, from the second portion of the input speech data to which each word of the synthetic speech data corresponds, ratio data comprising a ratio of a pronunciation time in the input speech data of each word of the second erroneous recognition text to a pronunciation time in the input speech data of each other word of the second erroneous recognition text;
  
  determining, through use of the computed ratio data, a first association between each word of the second erroneous recognition text and a time to reproduce each portion of the input speech data corresponding to said each word of the second erroneous recognition text; and
  
  recording the first association in a recording medium of the computer system and/or displaying the first association on a display device of the computer system.
- View Dependent Claims (18, 19, 20, 21, 22, 23, 24)
- - 18. The computer system of claim 17, wherein said generating the first dictionary comprises:
    - providing a basic dictionary stored in a basic dictionary database of the computer system, said basic dictionary comprising words and pronunciation data associated with each word of the basic dictionary for speaking each word of the basic dictionary;
      
      comparing the words in the input text with the words in the basic dictionary to determine words in the input text matching same words in the basic dictionary;
      
      for each matched word in the input text matching a same word in the basic dictionary, inserting the same word and the associated pronunciation data in the first dictionary,wherein the words comprised by the first dictionary comprise each inserted same word,wherein the first pronunciation speech data in the first dictionary comprises the inserted associated pronunciation data.
  - 19. The computer system of claim 18, wherein said generating the first dictionary further comprises:
    - for each unmatched word in the input text not matching any word in the basic dictionary, generating associated synthetic speech data and inserting each unmatched word with its associated synthetic speech data in the first dictionary,wherein the words comprised by the first dictionary further comprise each inserted unmatched word,wherein the first pronunciation speech data in the first dictionary further comprises the inserted associated synthetic speech data.
  - 20. The computer system of claim 17, wherein the method further comprises:
    - generating, from analysis of the input speech data, first time stamp data comprising a starting time and an ending time in the input speech data at which the first pronunciation speech data for each word of the first recognition text was spoken by the speaker; and
      
      prior to performing the second speech recognition, generating, through use of the generated first time stamp data, the first portion of the input speech data corresponding to the first erroneous recognition text.
  - 21. The computer system of claim 17, wherein the method further comprise generating a second dictionary stored in a second dictionary database of the computer system, said generating the second dictionary comprising:
    - inserting words of the first erroneous recognition text into the second dictionary and further comprising reading the first dictionary from the first dictionary database;
      
      removing from the read first dictionary at least one word included in the input text but not included in the first erroneous recognition text; and
      
      inserting the removed at least one word in the second dictionary,wherein said performing the second speech recognition comprises utilizing the inserted words in the second dictionary in conjunction with the first pronunciation speech data associated with each inserted word in the second dictionary.
  - 22. The computer system of claim 17, wherein said generating synthetic speech data comprises:
    - matching the words of the second erroneous recognition text with same words in a speech synthesis dictionary stored in a database of the computer system; and
      
      associating second pronunciation speech data in the speech synthesis dictionary corresponding to the same words in the speech synthesis dictionary with the matched words of the second erroneous recognition text.
  - 23. The computer system of claim 17, wherein the method further comprises:
    - generating, from analysis of the recognized words of the first portion of the input speech data, second time stamp data comprising a starting time and an ending time in the first portion of the input speech data at which each word of the second recognition text was spoken by the speaker; and
      
      recording the second time stamp data in the recording medium of the computer system.
  - 24. The computer system of claim 17, wherein the method further comprises:
    - determining, through use of the first recognition text, the second recognition text, the computed ratio data, and the first association, a second association between each word of the input speech data and a starting time and/or an ending time of each word of the input speech data; and
      
      recording the second association in the recording medium of the computer system and/or displaying the second association on the display device of the computer system.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Original Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Inventors
Watanabe, Takatoshi, Imoto, Noriko, Uda, Tetsuya
Primary Examiner(s)
Vo; Huyen X.

Application Number

US12/145,804
Publication Number

US 20090006087A1
Time in Patent Office

1,245 Days
Field of Search

704/231, 704/235, 704/254, 704/255, 704/258, 704/260, 704/270, 704/276, 704/246, 704/257, 704/270.1
US Class Current

704/231
CPC Class Codes

G10L 13/00 Speech synthesis; Text to s...

G10L 15/26 Speech to text systems G10L...

Synchronization of an input text of a speech with a recording of the speech

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

24 Claims

Specification

Solutions

Use Cases

Quick Links

Synchronization of an input text of a speech with a recording of the speech

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

24 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links