Synchronization of an input text of a speech with a recording of the speech
First Claim
1. A method for synchronizing words in an input text of a speech with a continuous recording of the speech, said method implemented by execution of instructions by a processor of a computer system, said instructions being stored on computer readable storage media of the computer system, said method comprising:
- generating a first dictionary stored in a first dictionary database of the computer system, said first dictionary comprising the words in the input text and associated first pronunciation speech data;
receiving input speech data encompassing the speech and being structured as a waveform obtained from the continuous recording of the speech spoken by a speaker reading the speech;
performing a first speech recognition of the input speech data, by comparing the input speech data with the first pronunciation speech data in the first dictionary, to generate a first recognition text comprising recognized words of the input text;
determining, by the processor of the computer system, from comparing the input text with the first recognition text, first erroneous recognition text comprising words of the input text erroneously recognized during performing the first speech recognition and not matching respective words of the first recognition text;
performing a second speech recognition of a first portion of the input speech data, corresponding to the first erroneous recognition text, to generate a second recognition text comprising recognized words of the first portion of the input speech data;
determining, by the processor of the computer system, from comparing the second recognition text with the first erroneous recognition text, second erroneous recognition text comprising words of the first erroneous recognition text differing from the words of second recognition text;
generating synthetic speech data corresponding to the second erroneous recognition text;
determining a second portion of the input speech data to which each word of the synthetic speech data corresponds;
computing, from the second portion of the input speech data to which each word of the synthetic speech data corresponds, ratio data comprising a ratio of a pronunciation time in the input speech data of each word of the second erroneous recognition text to a pronunciation time in the input speech data of each other word of the second erroneous recognition text;
determining, by the processor of the computer system, through use of the computed ratio data, a first association between each word of the second erroneous recognition text and a time to reproduce each portion of the input speech data corresponding to said each word of the second erroneous recognition text; and
recording the first association in a recording medium of the computer system and/or displaying the first association on a display device of the computer system.
2 Assignments
0 Petitions
Accused Products
Abstract
A method and system for synchronizing words in an input text of a speech with a continuous recording of the speech. A received input text includes previously recorded content of the speech to be reproduced. A synthetic speech corresponding to the received input text is generated. Ratio data including a ratio between the respective pronunciation times of words included in the received text in the generated synthetic speech is computed. The ratio data is used to determine an association between erroneously recognized words of the received text and a time to reproduce each erroneously recognized word. The association is outputted in a recording medium and/or displayed on a display device.
-
Citations
24 Claims
-
1. A method for synchronizing words in an input text of a speech with a continuous recording of the speech, said method implemented by execution of instructions by a processor of a computer system, said instructions being stored on computer readable storage media of the computer system, said method comprising:
-
generating a first dictionary stored in a first dictionary database of the computer system, said first dictionary comprising the words in the input text and associated first pronunciation speech data; receiving input speech data encompassing the speech and being structured as a waveform obtained from the continuous recording of the speech spoken by a speaker reading the speech; performing a first speech recognition of the input speech data, by comparing the input speech data with the first pronunciation speech data in the first dictionary, to generate a first recognition text comprising recognized words of the input text; determining, by the processor of the computer system, from comparing the input text with the first recognition text, first erroneous recognition text comprising words of the input text erroneously recognized during performing the first speech recognition and not matching respective words of the first recognition text; performing a second speech recognition of a first portion of the input speech data, corresponding to the first erroneous recognition text, to generate a second recognition text comprising recognized words of the first portion of the input speech data; determining, by the processor of the computer system, from comparing the second recognition text with the first erroneous recognition text, second erroneous recognition text comprising words of the first erroneous recognition text differing from the words of second recognition text; generating synthetic speech data corresponding to the second erroneous recognition text; determining a second portion of the input speech data to which each word of the synthetic speech data corresponds; computing, from the second portion of the input speech data to which each word of the synthetic speech data corresponds, ratio data comprising a ratio of a pronunciation time in the input speech data of each word of the second erroneous recognition text to a pronunciation time in the input speech data of each other word of the second erroneous recognition text; determining, by the processor of the computer system, through use of the computed ratio data, a first association between each word of the second erroneous recognition text and a time to reproduce each portion of the input speech data corresponding to said each word of the second erroneous recognition text; and recording the first association in a recording medium of the computer system and/or displaying the first association on a display device of the computer system. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A computer program product, comprising a computer readable storage device having a computer readable program code stored therein, said computer readable program code containing instructions that when executed by a processor of a computer system implement a method for synchronizing words in an input text of a speech with a continuous recording of the speech, said method comprising:
-
generating a first dictionary stored in a first dictionary database of the computer system, said first dictionary comprising the words in the input text and associated first pronunciation speech data; receiving input speech data encompassing the speech and being structured as a waveform obtained from the continuous recording of the speech spoken by a speaker reading the speech; performing a first speech recognition of the input speech data, by comparing the input speech data with the first pronunciation speech data in the first dictionary, to generate a first recognition text comprising recognized words of the input text; determining, from comparing the input text with the first recognition text, first erroneous recognition text comprising words of the input text erroneously recognized during performing the first speech recognition and not matching respective words of the first recognition text; performing a second speech recognition of a first portion of the input speech data, corresponding to the first erroneous recognition text, to generate a second recognition text comprising recognized words of the first portion of the input speech data; determining, from comparing the second recognition text with the first erroneous recognition text, second erroneous recognition text comprising words of the first erroneous recognition text differing from the words of second recognition text; generating synthetic speech data corresponding to the second erroneous recognition text; determining a second portion of the input speech data to which each word of the synthetic speech data corresponds; computing, from the second portion of the input speech data to which each word of the synthetic speech data corresponds, ratio data comprising a ratio of a pronunciation time in the input speech data of each word of the second erroneous recognition text to a pronunciation time in the input speech data of each other word of the second erroneous recognition text; determining, through use of the computed ratio data, a first association between each word of the second erroneous recognition text and a time to reproduce each portion of the input speech data corresponding to said each word of the second erroneous recognition text; and recording the first association in a recording medium of the computer system and/or displaying the first association on a display device of the computer system. - View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
-
-
17. A computer system comprising a processor and a computer readable memory unit coupled to the processor, said memory unit containing instructions that when executed by the processor implement a method for synchronizing words in an input text of a speech with a continuous recording of the speech, said method comprising:
-
generating a first dictionary stored in a first dictionary database of the computer system, said first dictionary comprising the words in the input text and associated first pronunciation speech data; receiving input speech data encompassing the speech and being structured as a waveform obtained from the continuous recording of the speech spoken by a speaker reading the speech; performing a first speech recognition of the input speech data, by comparing the input speech data with the first pronunciation speech data in the first dictionary, to generate a first recognition text comprising recognized words of the input text; determining, from comparing the input text with the first recognition text, first erroneous recognition text comprising words of the input text erroneously recognized during performing the first speech recognition and not matching respective words of the first recognition text; performing a second speech recognition of a first portion of the input speech data, corresponding to the first erroneous recognition text, to generate a second recognition text comprising recognized words of the first portion of the input speech data; determining, from comparing the second recognition text with the first erroneous recognition text, second erroneous recognition text comprising words of the first erroneous recognition text differing from the words of second recognition text; generating synthetic speech data corresponding to the second erroneous recognition text; determining a second portion of the input speech data to which each word of the synthetic speech data corresponds; computing, from the second portion of the input speech data to which each word of the synthetic speech data corresponds, ratio data comprising a ratio of a pronunciation time in the input speech data of each word of the second erroneous recognition text to a pronunciation time in the input speech data of each other word of the second erroneous recognition text; determining, through use of the computed ratio data, a first association between each word of the second erroneous recognition text and a time to reproduce each portion of the input speech data corresponding to said each word of the second erroneous recognition text; and recording the first association in a recording medium of the computer system and/or displaying the first association on a display device of the computer system. - View Dependent Claims (18, 19, 20, 21, 22, 23, 24)
-
Specification