System And Method For Supporting Text-To-Speech

US 20080046247A1
Filed: 07/09/2007
Published: 02/21/2008
Est. Priority Date: 08/21/2006
Status: Active Grant

First Claim

Patent Images

1. A system for supporting text-to-speech, comprising:

a learning data generating unit which recognizes inputted speech, and generates first learning data in which wordings of phrases are associated with readings thereof;

a frequency data generating unit which generates, on the basis of the first learning data, frequency data indicating appearance frequencies of both wordings and readings of phrases;

a language processing unit; and

a setting unit which sets frequency data in the language processing unit for generating, from a wording of text, a reading corresponding to the wording, on the basis of appearance frequencies of readings corresponding to the wording in order to approximate outputted speech of text-to-speech to the inputted speech.

View all claims

8 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system for generating high-quality synthesized text-to-speech includes a learning data generating unit, a frequency data generating unit, and a setting unit. The learning data generating unit recognizes inputted speech, and then generates first learning data in which wordings of phrases are associated with readings thereof. The frequency data generating unit generates, based on the first learning data, frequency data indicating appearance frequencies of both wordings and readings of phrases. The setting unit sets the thus generated frequency data for a language processing unit in order to approximate outputted speech of text-to-speech to the inputted speech. Furthermore, the language processing unit generates, from a wording of text, a reading corresponding to the wording, on the basis of the appearance frequencies.

12 Citations

View as Search Results

35 Claims

1. A system for supporting text-to-speech, comprising:
- a learning data generating unit which recognizes inputted speech, and generates first learning data in which wordings of phrases are associated with readings thereof;
  
  a frequency data generating unit which generates, on the basis of the first learning data, frequency data indicating appearance frequencies of both wordings and readings of phrases;
  
  a language processing unit; and
  
  a setting unit which sets frequency data in the language processing unit for generating, from a wording of text, a reading corresponding to the wording, on the basis of appearance frequencies of readings corresponding to the wording in order to approximate outputted speech of text-to-speech to the inputted speech.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The system according to claim 1, further comprising a sound data generating unit, wherein:
    - the learning data generating unit recognizes speech, and generates second learning data in which readings of phrases are associated with waveform data of phonemes;
      
      the sound data generating unit generates sound data based on wordings of phrases and waveform data of phonemes in the second learning data;
      
      a sound processing unit; and
      
      the setting unit sets the sound data in the sound processing unit which outputs speech in accordance with readings of phrases in order to approximate the outputted speech of text-to-speech to the recognized speech.
  - 3. The system according to claim 1, wherein on the basis of the first learning data, the frequency data generating unit generates, as the frequency data, an appearance frequency at which each of combinations of readings corresponding to each of combinations of plural phrases continuously written appears.
  - 4. The system according to claim 3, further comprising an acquisition unit which acquires frequency data having been already set in the language processing unit, wherein:
    - the frequency data generating unit generates a frequency data candidate for each of combinations of plural phrases continuously written, by taking a weighted average of an appearance frequency of each combination of readings of the phrases in the frequency data having been acquired by the acquisition unit, and an appearance frequency at which the combination of readings of the phrases appears in the first learning data;
      
      with respect to the entire readings which are generated by the language processing unit on the basis of a phrase wording in the first learning data, and which correspond to each of plural frequency data candidates generated by using different weights in taking the weighted averages, the frequency data generating unit computes a ratio of readings identical with those in the first learning data to the entire readings; and
      
      the frequency data generating unit generates, as new frequency data, the frequency data candidate making the identical reading ratio not less than a predetermined criterion.
  - 5. The system according to claim 3, wherein:
    - the learning data generating unit recognizes inputted speech, and generates first learning data in which a wording, a reading and a word class of each of phrases are associated with one another; and
      
      on the basis of the first learning data, the frequency data generating unit generates, as the frequency data, appearance frequencies of combinations of readings, and appearance frequencies of combinations of word classes in each of combinations of plural phrases continuously written.
  - 6. The system according to claim 1, wherein:
    - the learning data generating unit recognizes inputted speech, and generates first learning data in which a wording, a reading and a word class of each of phrases are associated with one another; and
      
      the frequency data generating unit generates, as the frequency data indicating appearance frequencies of readings of the phrases, frequency data indicating appearance frequencies of each of the combinations of pronunciations and accents of the phrases.
  - 7. The system according to claim 6, wherein:
    - the learning data generating unit divides the speech of each phrase into plural accent phrases, classifies an accent of the speech in each of the accent phrases into any one of predetermined plural kinds of accents, and generates, as a reading of each phrase, a group of kinds of accents in each of the accent phrases; and
      
      the frequency data generating unit generates the frequency data by judging that an appearance frequency of the group of kinds of accents in each of the accent phrases of a phrase in the first learning data is an appearance frequency of a reading of the phrase in the first learning data.

8. A system for supporting text-to-speech, comprising:
- a learning data generating unit which recognizes inputted speech, and generates first learning data in which wordings of phrases are associated with readings thereof;
  
  a language processing unit; and
  
  a learning unit which causes the language processing unit to learn on the basis of the first learning data, the language processing unit generating, from a wording of text, a reading corresponding to the wording, on the basis of appearance frequencies in the first learning data in order to approximate outputted speech of text-to-speech to the inputted speech.
- View Dependent Claims (9, 10, 11, 12, 13, 14)
- - 9. The system according to claim 8, further comprising a sound data generating unit, wherein:
    - the learning data generating unit recognizes speech, and generates second learning data in which readings of phrases are associated with waveform data of phonemes;
      
      the sound data generating unit generates sound data based on wordings of phrases and waveform data of phonemes in the second learning data;
      
      a sound processing unit; and
      
      a setting unit sets the sound data in the sound processing unit which outputs speech in accordance with readings of phrases in order to approximate the outputted speech of text-to-speech to the recognized speech.
  - 10. The system according to claim 8, wherein on the basis of the first learning data, the frequency data generating unit generates, as the frequency data, an appearance frequency at which each of combinations of readings corresponding to each of combinations of plural phrases continuously written appears.
  - 11. The system according to claim 10, further comprising an acquisition unit which acquires frequency data having been already set in the language processing unit, whereinthe frequency data generating unit generates a frequency data candidate for each of combinations of plural phrases continuously written, by taking a weighted average of an appearance frequency of each combination of readings of the phrases in the frequency data having been acquired by the acquisition unit, and an appearance frequency at which the combination of readings of the phrases appears in the first learning data;
    - with respect to the entire readings which are generated by the language processing unit on the basis of a phrase wording in the first learning data, and which correspond to each of plural frequency data candidates generated by using different weights in taking the weighted averages, the frequency data generating unit computes a ratio of readings identical with those in the first learning data to the entire readings; and
      
      the frequency data generating unit generates, as new frequency data, the frequency data candidate making the identical reading ratio not less than a predetermined criterion.
  - 12. The system according to claim 10, wherein:
    - the learning data generating unit recognizes inputted speech, and generates first learning data in which a wording, a reading, and a word class of each of phrases are associated with one another; and
      
      on the basis of the first learning data, the frequency data generating unit generates, as the frequency data, appearance frequencies of combinations of readings, and appearance frequencies of combinations of word classes in each of combinations of plural phrases continuously written.
  - 13. The system according to claim 8, wherein:
    - the learning data generating unit recognizes inputted speech, and generates first learning data in which a wording, a reading, and a word class of each of phrases are associated with one another; and
      
      the frequency data generating unit generates, as the frequency data indicating appearance frequencies of readings of the phrases, frequency data indicating appearance frequencies of each of the combinations of pronunciations and accents of the phrases.
  - 14. The system according to claim 13, wherein:
    - the learning data generating unit divides the speech of each phrase into plural accent phrases, classifies an accent of the speech in each of the accent phrases into any one of predetermined plural kinds of accents, and generates, as a reading of each phrase, a group of kinds of accents in each of the accent phrases; and
      
      the frequency data generating unit generates the frequency data by judging that an appearance frequency of the group of kinds of accents in each of the accent phrases of a phrase in the first learning data is an appearance frequency of a reading of the phrase in the first learning data.

15. A method of supporting text-to-speech, comprising the steps of:
- recognizing inputted speech, and generating first learning data in which wordings of phrases are associated with readings thereof;
  
  generating, on the basis of the first learning data, frequency data indicating appearance frequencies of both wordings, and readings of phrases; and
  
  setting frequency data in a language processing unit which generates, from a wording of text, a reading corresponding to the wording, on the basis of appearance frequencies of readings corresponding to the wording in order to approximate outputted speech of text-to-speech to the inputted speech.
- View Dependent Claims (16, 17, 18, 19, 20, 21)
- - 16. The method according to claim 15, further comprising the steps of:
    - recognizing speech, and generating second learning data in which readings of phrases are associated with waveform data of phonemes;
      
      generating sound data based on wordings of phrases and waveform data of phonemes in the second learning data; and
      
      setting the sound data in a sound processing unit which outputs speech in accordance with readings of phrases in order to approximate the outputted speech of text-to-speech to the recognized speech.
  - 17. The method according to claim 15, further comprising the step of generating as the frequency data, on the basis of the first learning data, an appearance frequency at which each of combinations of readings corresponding to each of combinations of plural phrases continuously written appears.
  - 18. The method according to claim 17, further comprising the steps of:
    - acquiring frequency data having been already set in the language processing unit;
      
      generating a frequency data candidate for each of combinations of plural phrases continuously written, by taking a weighted average of an appearance frequency of each combination of readings of the phrases in the frequency data having been acquired in the step of acquiring, and an appearance frequency at which the combination of readings of the phrases appears in the first learning data;
      
      with respect to the entire readings which are generated by the language processing unit on the basis of a phrase wording in the first learning data, and which correspond to each of plural frequency data candidates generated by using different weights in taking the weighted averages, computing a ratio of readings identical with those in the first learning data to the entire readings; and
      
      generating, as new frequency data, the frequency data candidate making the identical reading ratio not less than a predetermined criterion.
  - 19. The method according to claim 17, wherein:
    - recognizing inputted speech and generating first learning data in which a wording, a reading and a word class of each of phrases are associated with one another; and
      
      on the basis of the first learning data, generating, as the frequency data, appearance frequencies of combinations of readings, and appearance frequencies of combinations of word classes in each of combinations of plural phrases continuously written.
  - 20. The method according to claim 15, further comprising the steps of:
    - recognizing inputted speech and generating first learning data in which a wording, a reading and a word class of each of phrases are associated with one another; and
      
      generating, as the frequency data indicating appearance frequencies of readings of the phrases, frequency data indicating appearance frequencies of each of the combinations of pronunciations and accents of the phrases.
  - 21. The method according to claim 20, further comprising the steps of:
    - dividing the speech of each phrase into plural accent phrases;
      
      classifying an accent of the speech in each of the accent phrases into any one of predetermined plural kinds of accents;
      
      generating, as a reading of each phrase, a group of kinds of accents in each of the accent phrases; and
      
      generating the frequency data by judging that an appearance frequency of the group of kinds of accents in each of the accent phrases of a phrase in the first learning data is an appearance frequency of a reading of the phrase in the first learning data.

22. A program product for allowing an information processing apparatus to function as a system for supporting text-to-speech, the program product causing the information system to function as:
- a learning data generating unit which recognizes inputted speech, and generates first learning data in which wordings of phrases are associated with readings thereof;
  
  a frequency data generating unit which generates, on the basis of the first learning data, frequency data indicating appearance frequencies of both wordings, and readings of phrases; and
  
  a setting unit which, in order to approximate outputted speech of text-to-speech to the inputted speech, sets frequency data in a language processing unit for generating, from a wording of text, a reading corresponding the wording, on the basis of appearance frequencies of readings corresponding to the wording.
- View Dependent Claims (23, 24, 25, 26, 27, 28)
- - 23. The program product according to claim 22, wherein the program product further causes the information system to function as a sound data generating unit, whereinthe learning data generating unit recognizes speech, and generates second learning data in which readings of phrases are associated with waveform data of phonemes;
    - the sound data generating unit generates sound data based on wordings of phrases and waveform data of phonemes in the second learning data; and
      
      the setting unit sets the sound data in a sound processing unit which outputs speech in accordance with readings of phrases in order to approximate the outputted speech of text-to-speech to the recognized speech.
  - 24. The program product according to claim 22, wherein the program product further causes the information system to function such that on the basis of the first learning data, the frequency data generating unit generates, as the frequency data, an appearance frequency at which each of combinations of readings corresponding to each of combinations of plural phrases continuously written appears.
  - 25. The program product according to claim 24, wherein the program product further causes the information system to function as an acquisition unit which acquires frequency data having been already set in the language processing unit, wherein:
    - the frequency data generating unit generates a frequency data candidate for each of combinations of plural phrases continuously written, by taking a weighted average of an appearance frequency of each combination of readings of the phrases in the frequency data having been acquired by the acquisition unit, and an appearance frequency at which the combination of readings of the phrases appears in the first learning data;
      
      with respect to readings (called, the entire readings) which are generated by the language processing unit on the basis of a phrase wording in the first learning data, and which correspond to each of plural frequency data candidates generated by using different weights in taking the weighted averages, the frequency data generating unit computes a ratio of readings identical with those in the first learning data to the entire readings; and
      
      the frequency data generating unit generates, as new frequency data, the frequency data candidate making the identical reading ratio not less than a predetermined criterion.
  - 26. The program product according to claim 24, wherein the program product further causes the information system to function such that the learning data generating unit recognizes inputted speech, and generates first learning data in which a wording, a reading and a word class of each of phrases are associated with one another;
    - and on the basis of the first learning data, the frequency data generating unit generates, as the frequency data, appearance frequencies of combinations of readings, and appearance frequencies of combinations of word classes in each of combinations of plural phrases continuously written.
  - 27. The program product according to claim 22, wherein the program product further causes the information system to function such that the learning data generating unit recognizes inputted speech, and generates first learning data in which a wording, a reading and a word class of each of phrases are associated with one another;
    - and the frequency data generating unit generates, as the frequency data indicating appearance frequencies of readings of the phrases, frequency data indicating appearance frequencies of each of the combinations of pronunciations and accents of the phrases.
  - 28. The program product according to claim 27, wherein the program product further causes the information system to function such that the learning data generating unit divides the speech of each phrase into plural accent phrases, classifies an accent of the speech in each of the accent phrases into any one of predetermined plural kinds of accents, and generates, as a reading of each phrase, a group of kinds of accents in each of the accent phrases;
    - and the frequency data generating unit generates the frequency data by judging that an appearance frequency of the group of kinds of accents in each of the accent phrases of a phrase in the first learning data is an appearance frequency of a reading of the phrase in the first learning data.

29. An article of manufacture comprising a computer usable medium having computer readable program code means embodied therein for supporting text-to-speech, the computer readable program code means in said article of manufacture comprising:
- computer readable program code means for causing a computer to effect recognizing inputted speech and generating first learning data in which wordings of phrases are associated with readings thereof;
  
  computer readable program code means for causing a computer to effect generating, on the basis of the first learning data, frequency data indicating appearance frequencies of both wordings, and readings of phrases; and
  
  computer readable program code means for causing a computer to effect setting frequency data in a language processing unit which generates, from a wording of text, a reading corresponding to the wording, on the basis of appearance frequencies of readings corresponding to the wording in order to approximate outputted speech of text-to-speech to the inputted speech.
- View Dependent Claims (30, 31, 32, 33, 34, 35)
- - 30. The article of manufacture according to claim 29, further comprising computer readable program code means for causing a computer to effect:
    - recognizing speech, and generating second learning data in which readings of phrases are associated with waveform data of phonemes;
      
      generating sound data based on wordings of phrases and waveform data of phonemes in the second learning data; and
      
      setting the sound data in a sound processing unit which outputs speech in accordance with readings of phrases in order to approximate the outputted speech of text-to-speech to the recognized speech.
  - 31. The article of manufacture according to claim 29, further comprising computer readable program code means for causing a computer to effect generating as the frequency data, on the basis of the first learning data, an appearance frequency at which each of combinations of readings corresponding to each of combinations of plural phrases continuously written appears.
  - 32. The article of manufacture according to claim 31, further comprising computer readable program code means for causing a computer to effect:
    - acquiring frequency data having been already set in the language processing unit;
      
      generating a frequency data candidate for each of combinations of plural phrases continuously written, by taking a weighted average of an appearance frequency of each combination of readings of the phrases in the frequency data having been acquired in the step of acquiring, and an appearance frequency at which the combination of readings of the phrases appears in the first learning data;
      
      with respect to the entire readings which are generated by the language processing unit on the basis of a phrase wording in the first learning data, and which correspond to each of plural frequency data candidates generated by using different weights in taking the weighted averages, computing a ratio of readings identical with those in the first learning data to the entire readings; and
      
      generating, as new frequency data, the frequency data candidate making the identical reading ratio not less than a predetermined criterion.
  - 33. The article of manufacture according to claim 31, further comprising computer readable program code means for causing a computer to effect:
    - recognizing inputted speech and generating first learning data in which a wording, a reading and a word class of each of phrases are associated with one another; and
      
      on the basis of the first learning data, generating, as the frequency data, appearance frequencies of combinations of readings, and appearance frequencies of combinations of word classes in each of combinations of plural phrases continuously written.
  - 34. The article of manufacture according to claim 29, further comprising computer readable program code means for causing a computer to effect:
    - recognizing inputted speech and generating first learning data in which a wording, a reading and a word class of each of phrases are associated with one another; and
      
      generating, as the frequency data indicating appearance frequencies of readings of the phrases, frequency data indicating appearance frequencies of each of the combinations of pronunciations and accents of the phrases.
  - 35. The article of manufacture according to claim 34, further comprising computer readable program code means for causing a computer to effect:
    - dividing the speech of each phrase into plural accent phrases;
      
      classifying an accent of the speech in each of the accent phrases into any one of predetermined plural kinds of accents;
      
      generating, as a reading of each phrase, a group of kinds of accents in each of the accent phrases; and
      
      generating the frequency data by judging that an appearance frequency of the group of kinds of accents in each of the accent phrases of a phrase in the first learning data is an appearance frequency of a reading of the phrase in the first learning data.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Cerence Operating Company (Cerence Inc.)
Original Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Inventors
Nagano, Toru, Nishimura, Masafumi, Kurata, Gakuto, Tachibana, Ryuki

Granted Patent

US 7,921,014 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/260
CPC Class Codes

G10L 13/04 Details of speech synthesis...

G10L 15/26 Speech to text systems G10L...

System And Method For Supporting Text-To-Speech

First Claim

8 Assignments

0 Petitions

Accused Products

Abstract

12 Citations

35 Claims

Specification

Solutions

Use Cases

Quick Links

System And Method For Supporting Text-To-Speech

First Claim

8 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

12 Citations

35 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links