Chinese speech recognition system and method

US 9,190,051 B2
Filed: 04/13/2012
Issued: 11/17/2015
Est. Priority Date: 05/10/2011
Status: Active Grant

First Claim

Patent Images

1. A Chinese speech recognition system comprisinga language model storage device containing a plurality of language models, including a factored language model;

a hierarchical prosodic model comprising a plurality of prosodic models, including a prosodic break model, a prosodic state model, a syllable prosodic-acoustic model and a syllable-juncture prosodic-acoustic model;

a speech recognition device receiving a speech signal, recognizing said speech signal and outputting a word lattice; and

a rescorer connected with said language model storage device, said hierarchical prosodic model and said speech recognition device, receiving said word lattice, rescoring and reranking word arcs of said word lattice according to said prosodic break model, said prosodic state model, said syllable prosodic-acoustic model and said syllable juncture prosodic-acoustic model, and outputting a language tag, a prosodic tag and a phonetic segmentation tag corresponding to said speech signal.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A Chinese speech recognition system and method is disclosed. Firstly, a speech signal is received and recognized to output a word lattice. Next, the word lattice is received, and word arcs of the word lattice are rescored and reranked with a prosodic break model, a prosodic state model, a syllable prosodic-acoustic model, a syllable-juncture prosodic-acoustic model and a factored language model, so as to output a language tag, a prosodic tag and a phonetic segmentation tag, which correspond to the speech signal. The present invention performs rescoring in a two-stage way to promote the recognition rate of basic speech information and labels the language tag, prosodic tag and phonetic segmentation tag to provide the prosodic structure and language information for the rear-stage voice conversion and voice synthesis.

Citations

24 Claims

1. A Chinese speech recognition system comprisinga language model storage device containing a plurality of language models, including a factored language model;
- a hierarchical prosodic model comprising a plurality of prosodic models, including a prosodic break model, a prosodic state model, a syllable prosodic-acoustic model and a syllable-juncture prosodic-acoustic model;
  
  a speech recognition device receiving a speech signal, recognizing said speech signal and outputting a word lattice; and
  
  a rescorer connected with said language model storage device, said hierarchical prosodic model and said speech recognition device, receiving said word lattice, rescoring and reranking word arcs of said word lattice according to said prosodic break model, said prosodic state model, said syllable prosodic-acoustic model and said syllable juncture prosodic-acoustic model, and outputting a language tag, a prosodic tag and a phonetic segmentation tag corresponding to said speech signal.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
- - 2. The Chinese speech recognition system according to claim 1, wherein said hierarchical prosodic model further comprisesa prosody-unlabeled database storing a plurality of speech files and a plurality of texts of said speech files;
    - a parameter extractor connected with said prosody-unlabeled database, extracting and outputting a plurality of low-level language parameters, a plurality of high-level language parameters, a syllable pitch-related prosodic-acoustic parameter, a syllable duration-related prosodic-acoustic parameter, and a syllable energy-related prosodic-acoustic parameter according to said speech files and said texts of said speech files;
      
      a Chinese prosody-hierarchy structure provider providing a plurality of prosodic components and a plurality of prosodic break tags separating said prosodic components; and
      
      a joint prosody labeling and modeling processor connected with said parameter extractor and said Chinese prosody-hierarchy structure provider, acquiring said low-level language parameters, said high-level language parameters, said syllable pitch-related prosodic-acoustic parameter, said syllable duration-related prosodic-acoustic parameter, and said syllable energy-related prosodic-acoustic parameter to estimate a prosodic state sequence P and a prosodic break sequence B, training said low-level language parameters, said high-level language parameters, a prosodic-acoustic parameter sequence X_P, said prosodic state sequence P and said prosodic break sequence B as said prosodic break model, said prosodic state model, said syllable prosodic-acoustic model and said syllable juncture prosodic-acoustic model to output them, and automatically tagging said prosodic state sequence P and said prosodic break sequence B on said speech signal.
  - 3. The Chinese speech recognition system according to claim 2, wherein said prosodic components includes syllables, prosodic words, prosodic phrases, and either of a breath group and a prosodic phrase group.
  - 4. The Chinese speech recognition system according to claim 2, wherein said joint prosody labeling and modeling processor estimates said prosodic state sequence P and said prosodic break sequence B according to a maximum likelihood criterion.
  - 5. The Chinese speech recognition system according to claim 2, wherein said joint prosody labeling and modeling processor trains low-level language parameters, said high-level language parameters, said prosodic-acoustic parameter sequence X_P, said prosodic state sequence P and said prosodic break sequence B as said prosodic break model, said prosodic state model, said syllable prosodic-acoustic model and said syllable-juncture prosodic-acoustic model according to a sequential optimization algorithm.
  - 6. The Chinese speech recognition system according to claim 2, wherein said factored language model is expressed by an equation:
  - 7. The Chinese speech recognition system according to claim 2, wherein said prosodic break model is expressed by
  - 8. The Chinese speech recognition system according to claim 2, wherein said prosodic state model is expressed by
  - 9. The Chinese speech recognition system according to claim 2, wherein said syllable prosodic-acoustic model is expressed by
  - 10. The Chinese speech recognition system according to claim 2, wherein said syllable juncture prosodic-acoustic model is expressed by
  - 11. The Chinese speech recognition system according to claim 1, wherein said speech recognition device contains an acoustic model and a bigram language model and uses said acoustic model and said bigram language model to recognize said speech signal for outputting said word lattice.
  - 12. The Chinese speech recognition system according to claim 11, wherein said rescorer performs rescoring according to an equation:

13. A Chinese speech recognition method comprising steps:
- receiving a speech signal, recognizing said speech signal and outputting a word lattice by a speech recognition device; and
  
  receiving said word lattice, rescoring word arcs of said word lattice according to a prosodic break model, a prosodic state model, a syllable prosodic acoustic model, a syllable-juncture prosodic-acoustic model and a factored language model stored in a language model storage device, reranking said word arcs, and outputting a language tag, a prosodic tag and a phonetic segmentation tag by a rescorer.
- View Dependent Claims (14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24)
- - 14. The Chinese speech recognition method according to claim 13, wherein said prosodic break model, said prosodic state model and said syllable prosodic-acoustic model, said syllable juncture prosodic-acoustic model are generated according to steps:
    - extracting a plurality of low-level language parameters, a plurality of high-level language parameters, syllable pitch-related prosodic-acoustic parameter, a syllable duration-related prosodic-acoustic parameter, and a syllable energy-related prosodic-acoustic parameter according to a plurality of speech files and a plurality of texts of said speech files, and outputting said low-level language parameters, said high-level language parameters, said syllable pitch-related prosodic-acoustic parameter, said syllable duration-related prosodic-acoustic parameter, and said syllable energy-related prosodic-acoustic parameter by a hierarchical prosodic model;
      
      acquiring said low-level language parameters, said high-level language parameters, said syllable pitch-related prosodic-acoustic parameter, said syllable duration-related prosodic-acoustic parameter, and said syllable energy-related prosodic-acoustic parameter to estimate, a prosodic state sequence P, and a prosodic break sequence B by the hierarchical prosodic model; and
      
      training a prosodic acoustic parameter sequence X_P, said prosodic state sequence P, and said prosodic break sequence B as said prosodic break model, said prosodic state model, said syllable prosodic-acoustic model and said syllable juncture prosodic-acoustic model, outputting said prosodic break model, said prosodic state model, said syllable prosodic-acoustic model and said syllable juncture prosodic-acoustic model, and automatically tagging said prosodic state sequence P and said prosodic break sequence B on said speech signal by the hierarchical prosodic model.
  - 15. The Chinese speech recognition method according to claim 14, wherein said prosodic components includes syllables, prosodic words, prosodic phrases, and either of a breath group and a prosodic phrase group.
  - 16. The Chinese speech recognition method according to claim 14, wherein said prosodic state sequence P and said prosodic break sequence B are estimated according to a maximum likelihood criterion.
  - 17. The Chinese speech recognition method according to claim 14, wherein said prosodic acoustic parameter sequence X_P, said prosodic state sequence P and said prosodic break sequence B are trained as said prosodic break model, said prosodic state model, said syllable prosodic-acoustic model and said syllable juncture prosodic-acoustic model according to a sequential optimization algorithm.
  - 18. The Chinese speech recognition method according to claim 14, wherein said factored language model is expressed by an equation:
  - 19. The Chinese speech recognition method according to claim 14,
  - 20. The Chinese speech recognition method according to claim 14, wherein said prosodic state model is expressed by
  - 21. The Chinese speech recognition method according to claim 14, wherein said syllable prosodic-acoustic model is expressed by
  - 22. The Chinese speech recognition method according to claim 14, wherein said syllable juncture prosodic-acoustic model is expressed by
  - 23. The Chinese speech recognition method according to claim 13, wherein an acoustic model and a bigram language model are used to recognize said speech signal.
  - 24. The Chinese speech recognition method according to claim 23, wherein in said step of rescoring word arcs of said word lattice, an equation is used:

Specification

Resources

Litigation Campaign Assessment

Current Assignee
National Chiao Tung University (Government of The Republic of China)
Original Assignee
National Chiao Tung University (Government of The Republic of China)
Inventors
Chiang, Chen-Yu, Liu, Ming-Chieh, Liao, Yuan-Fu, Chen, Sin-Horng, Yang, Jyh-Her, Wang, Yih-Ru
Primary Examiner(s)
ADESANYA, OLUJIMI A

Application Number

US13/446,663
Publication Number

US 20120290302A1
Time in Patent Office

1,313 Days
Field of Search

704/9
US Class Current

1/1
CPC Class Codes

G10L 15/063   Training

G10L 15/08   Speech classification or se...

G10L 15/1807   using prosody or stress

Chinese speech recognition system and method

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

24 Claims

Specification

Solutions

Use Cases

Quick Links

Chinese speech recognition system and method

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

24 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links