Speech recognition-synthesis based encoding/decoding method, and speech encoding/decoding system

US 6,161,091 A
Filed: 03/17/1998
Issued: 12/12/2000
Est. Priority Date: 03/18/1997
Status: Expired due to Term

First Claim

Patent Images

1. A speech recognition synthesis based encoding/decoding method comprising the steps of:

recognizing character information from an input speech signal;

detecting first prosody information from said input speech signal;

encoding said character information and said first prosody information to acquire code data;

transferring or storing the code data;

decoding said transferred or stored code data to said character information and said first prosody information;

selecting a synthesis unit codebook from a plurality of synthesis unit codebooks in accordance with one of said first prosody information and a specified type of a synthesized speech, the plurality of synthesis unit codebooks storing second prosody information prepared from speech data of different speakers, the selecting step including computing error between the first prosody information and the second prosody information and selecting from said synthesis unit codebooks a synthesis unit codebook which minimizes the error; and

synthesizing a speech signal using said character information and the selected said synthesis unit codebook.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A speech recognition synthesis based encoding/decoding method recognizes phonetic segments, syllables, words or the like as character information from an input speech signal and detects pitch periods, phoneme or syllable durations or the like, as information for prosody generation, from the input speech signal, transfers or stores the character information and information for prosody generation as code data, decodes the transferred or stored code data to acquire the character information and information for prosody generation, and synthesizes the acquired character information and information for prosody generation to obtain a speech signal.

Citations

26 Claims

1. A speech recognition synthesis based encoding/decoding method comprising the steps of:
- recognizing character information from an input speech signal;
  
  detecting first prosody information from said input speech signal;
  
  encoding said character information and said first prosody information to acquire code data;
  
  transferring or storing the code data;
  
  decoding said transferred or stored code data to said character information and said first prosody information;
  
  selecting a synthesis unit codebook from a plurality of synthesis unit codebooks in accordance with one of said first prosody information and a specified type of a synthesized speech, the plurality of synthesis unit codebooks storing second prosody information prepared from speech data of different speakers, the selecting step including computing error between the first prosody information and the second prosody information and selecting from said synthesis unit codebooks a synthesis unit codebook which minimizes the error; and
  
  synthesizing a speech signal using said character information and the selected said synthesis unit codebook.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The speech recognition synthesis based encoding/decoding method according to claim 1, wherein said recognizing step includes dividing said input speech signal into analysis frames, acquiring a feature vector for each of the analysis frames, and computing a similarity between said feature vector for each of the analysis frames and a feature template vector previously prepared for each phonetic segment to determine a phonetic segment of each of the analysis frames which is used to recognize the character information.
  - 3. The speech recognition synthesis based encoding/decoding method according to claim 2, wherein said similarity computing step includes computing a Euclidean distance based on said feature vector and said feature template vector to determine a phonetic segment which minimizes said Euclidean distance as a phonetic segment of said synthesis frame.
  - 4. The speech recognition synthesis based encoding/decoding method according to claim 2, further comprising the steps of determining if said input speech signal is a voiced speech or a unvoiced speech and detecting a pitch period of said input speech signal when determined as a voiced speech, and detecting a duration of said phonetic segment recognized by said recognizing step.
  - 5. The speech recognition synthesis based encoding/decoding method according to claim 1, wherein said recognizing step includes dividing said input speech signal into analysis frames, acquiring a feature vector for each of the analysis frames, and computing an incidence of the feature vector relative to HMM (Hidden Markov Model) previously prepared for each phonetic segment to determine a phonetic segment of each of the analysis frames which is used to recognize the character information.
  - 6. The method according to claim 1, wherein said transferring/storing step includes the step of transferring or storing select information indicating the specified type of a synthesized speech.
  - 7. The method according to claim 6, which includes the step of altering intonation and voice properties of the synthesized speech in accordance with the select information.
  - 8. The method according to claim 1, wherein said selecting step includes the step of generating select information indicating the specified type of a synthesized speech to select the one of said synthesis unit codebooks in accordance with the select information.

9. A speech recognition synthesis based encoding/decoding method comprising the steps of:
- recognizing phonetic segments, syllables or words as character information from an input speech signal;
  
  detecting pitch periods and durations of said phonetic segments or syllables, as first prosody information, from said input speech signal;
  
  encoding said character information and said first prosody information to obtain code data;
  
  transferring or storing said code data;
  
  decoding said transferred or stored code data to said character information and said first prosody information;
  
  selecting a synthesis unit codebook from a plurality of synthesis unit codebooks in accordance with one of said first prosody information and a specified type of a synthesized speech, the plurality of synthesis unit codebooks storing second prosody information prepared from speech data of different speakers, the selecting step including computing error between the first prosody information and the second prosody information and selecting from said synthesis unit codebooks a synthesis unit codebook which minimizes the error; and
  
  synthesizing a speech signal using said character information and the selected synthesis unit codebook.
- View Dependent Claims (10, 11, 12, 13, 14, 15, 16, 17)
- - 10. The speech recognition synthesis based encoding/decoding method according to claim 9, wherein said recognizing step includes dividing said input speech signal into analysis frames, acquiring a feature vector for each of the analysis frames, and computing a similarity between said feature vector for each of the analysis frames and a feature template vector previously prepared for each phonetic segment to determine a phonetic segment of said each synthesis frame which is used to recognize the character information.
  - 11. The speech recognition synthesis based encoding/decoding method according to claim 10, wherein said similarity computing step includes computing a Euclidean distance based on said feature vector and said feature template vector to determine a phonetic segment which minimizes said Euclidean distance as a phonetic segment of said analysis frames.
  - 12. The speech recognition synthesis based encoding/decoding method according to claim 10, further comprising the steps of determining if said input speech signal is a voiced speech or a unvoiced speech to detect a pitch period of said input speech signal when determined as a voiced speech, and detecting a duration of a phonetic segment recognized by said recognizing and detecting step.
  - 13. The speech recognition synthesis based encoding/decoding method according to claim 10, wherein said synthesizing step includes coupling spectral parameters corresponding to individual phonetic segments as a word or a sentence, processing an excitation signal based on a data stream including said phonetic segments, pitch periods and durations in accordance with said pitch period and said durations to generate an excitation signal for a synthesis filter, and processing said spectral parameters and said excitation signal in accordance with a speech synthesis model to produce a synthesized speech signal.
  - 14. The speech recognition synthesis based encoding/decoding method according to claim 9, wherein said recognizing step includes dividing said input speech signal into analysis frames, acquiring a feature vector for each of the analysis frames, and computing an incidence of the feature vector relative to HMM (Hidden Markov Model) previously prepared for each phonetic segment to determine a phonetic segment of each of the analysis frames which is used to recognize the character information.
  - 15. The method according to claim 9, wherein said transferring/storing step includes the step of transferring or storing select information indicating the specified type of a synthesized speech.
  - 16. The method according to claim 15, which includes the step of altering intonation and voice properties of the synthesized speech in accordance with the select information.
  - 17. The method according to claim 9, wherein said selecting step includes the step of generating select information indicating the specified type of a synthesized speech to select the one of said synthesis unit codebooks in accordance with the select information.

18. A speech encoding/decoding system comprising:
- a recognition section configured to recognize character information from an input speech signal;
  
  a detection section configured to detect first prosody information from said input speech signal;
  
  an encoding section configured to encode said character information and said first prosody information to code data;
  
  a transfer/storage section configured to transfer or store said code data acquired by said encoding section;
  
  a decoding section configured to decode said transferred or stored code data to said character information and said first prosody information;
  
  a plurality of synthesis unit codebooks storing second prosody information prepared from speech data of different speakers;
  
  a controller configured to select one of said synthesis unit codebooks in accordance with one of said first prosody information and a specified type of a synthesized speech by computing error between the first prosody information and the second prosody information and selecting from said synthesis unit codebooks a synthesis unit codebook which minimizes the error; and
  
  a synthesis section configured to synthesize a speech signal using said character information and the selected one of said synthesis unit codebooks.
- View Dependent Claims (19, 20, 21, 22, 23, 24, 25)
- - 19. The speech encoding/decoding system according to claim 18, wherein said recognition section includes an analysis frame generation section configured to divide said input speech signal into analysis frames, a feature extraction section configured to acquire a feature vector for each of the analysis frames, and a phonetic segment determination section configured to compute a similarity between said feature vector for each of the analysis frames and a feature template vector previously prepared for each phonetic segment to determine a phonetic segment of each of the analysis frames which is used to recognize the character information.
  - 20. The speech encoding/decoding system according to claim 19, wherein said phonetic segment determination section computes a Euclidean distance based on said feature vector and said feature template vector and determines a phonetic segment which minimizes said Euclidean distance as a phonetic segment of said analysis frames.
  - 21. The speech encoding/decoding system according to claim 19, wherein said detection section includes a pitch detector configured to determine if said input speech signal is a voiced speech or a unvoiced speech and detecting a pitch period of said input speech signal when determined as a voiced speech, and a duration detector configured to detect a duration of a phonetic segment recognized by said recognition section.
  - 22. The speech encoding/decoding system according to claim 18, wherein said recognition section includes an analysis frame generation section configured to divide said input speech signal into analysis frames, a feature extraction section configured to acquire a feature vector for each of the analysis frames, and a phonetic segment determination section configured to compute an incidence of the feature vector relative to HMM (Hidden Markov Model) previously prepared fore each phonetic segment to determine a phonetic segment of each of the analysis frames.
  - 23. The system according to claim 18, wherein said transfer/storage section is configured to generate and transfer or store select information indicating the specified type of a synthesized speech.
  - 24. The system according to claim 23, which includes an altering section configured to alter intonation and voice properties of the synthesized speech in accordance with the select information.
  - 25. The system according to claim 18, wherein said controller is configured to generate and transfer or store select information indicating the specified type of a synthesized speech to select the one of said synthesis unit codebooks in accordance with the select information.

26. A speech recognition synthesis based encoding method comprising the steps of:
- recognizing character information from an input speech signal;
  
  detecting prosody information from said input speech signal;
  
  generating select information indicating a type of a synthesized speech to be produced by a decoder based upon an error between the prosody information and stored prosody generation information;
  
  encoding said character information and said prosody information to acquire code data; and
  
  transferring or storing the code data and the select information.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Kabushiki Kaisha Toshiba (Toshiba Corporation)
Original Assignee
Kabushiki Kaisha Toshiba (Toshiba Corporation)
Inventors
Koshiba, Ryosuke, Akamine, Masami
Primary Examiner(s)
Smits, Talivaldis I.
Assistant Examiner(s)
CHAWAN, VIJAY B

Application Number

US09/042,612
Time in Patent Office

1,001 Days
Field of Search

704/214, 704/208, 704/260, 704/201, 704/270, 704/275, 704/256, 704/207, 704/258, 704/257, 704/264
US Class Current

704/258
CPC Class Codes

G10L 19/0018 Speech coding using phoneti...

Speech recognition-synthesis based encoding/decoding method, and speech encoding/decoding system

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

26 Claims

Specification

Solutions

Use Cases

Quick Links

Speech recognition-synthesis based encoding/decoding method, and speech encoding/decoding system

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

26 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links