Dialogue System Incorporating Unique Speech to Text Conversion Method for Meaningful Dialogue Response

US 20180308487A1
Filed: 04/21/2017
Published: 10/25/2018
Est. Priority Date: 04/21/2017
Status: Active Grant

First Claim

Patent Images

1. A system for providing real-time transcripts of spoken text, the system comprising:

a speech to text engine for converting an input speech of an end user into a text input, wherein the text input comprises one or more sequences of recognized word strings or a word lattice in text form; and

a semantic engine to receive the text input for producing one or more transcripts using a language model and extracting semantic meanings for said one or more transcripts;

wherein the semantic engine utilizes a grammar model and the language model to extract a meaning for said one or more transcripts.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A real-time dialogue system that provides real-time transcription of the spoken text, with a sub-second delay by keeping track of word timings and word accuracy is provided. The system uses a grammar or a list of keywords to produce the transcripts by using a statistical language model. In addition, the system uses a deep neural network based I-vector system to constantly analyze the audio quality to assess and to identify additional metadata such as gender, language, accent, age, emotion and identity of an end user to enhance the response. The present invention provides a conversational dialogue system, to robustly identify certain specific user commands or intents, while otherwise allowing for a natural conversation, without switching between grammar based and natural language modes.

Citations

19 Claims

1. A system for providing real-time transcripts of spoken text, the system comprising:
- a speech to text engine for converting an input speech of an end user into a text input, wherein the text input comprises one or more sequences of recognized word strings or a word lattice in text form; and
  
  a semantic engine to receive the text input for producing one or more transcripts using a language model and extracting semantic meanings for said one or more transcripts;
  
  wherein the semantic engine utilizes a grammar model and the language model to extract a meaning for said one or more transcripts.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
- - 2. The system of claim 1 further configured to (i) identify and store additional metadata about the end user selected from the group consisting of age, gender, language, accent, and emotional state and (ii) search and verify the end user'"'"'s identity.
  - 3. The system of claim 1, wherein the semantic engine, on receiving the one or more sequences of recognized word strings or the word lattice, extracts semantic meaning from the one or more sequences of recognized word strings or the word lattice, and associates that with one or more action tags and entities known to the system.
  - 4. The system of claim 3, wherein the semantic engine further comprises a semantic mapper that is configured to label the one or more action tags by including multiple possible meanings of a word from the one or more sequences of recognized word strings in a particular context.
  - 5. The system of claim 1, further comprising:
    - a query generator configured to map the one or more action tags to one or more database queries present in an interactive workflow logic module, wherein the interactive workflow logic module handles situations arising subsequent to a dialogue response; and
      
      a natural language generator trained to receive the mapped one or more action tags and said additional data, wherein the mapped one or more action tags and said additional data are mapped into one or more logical sentences to form a spoken response of the dialogue system in real-time.
  - 6. The system of claim 1, wherein the system utilizes GRXML or JSGF of ABNF format grammars to (i) learn the one or more action tags and entities of the semantic engine, and wherein the system enhances a vocabulary based on the grammar model and a vocabulary based on the language model.
  - 7. The system of claim 1 further comprises a language/accent recognition subsystem that extracts acoustic features from the input speech of the end user to identify language and/or accent the end user, wherein said language/accent recognition sub system comprises:
    - a speech activity detection module to detect speech activity;
      
      a shifted delta cepstral (SDC) module to compute cepstral mean and variance normalization of the input speech and to produce SDC feature vectors;
      
      an I-vector extractor module to receive SDC feature vectors and to produce I-vectors using a deep neural network-universal background model (DNN-UBM); and
      
      a logistic regression classifier module to receive and classify the I-vectors in order to identify the end user'"'"'s language or accent.
  - 8. The system of claim 1 wherein the system further comprises a speaker recognition (SR) subsystem that extracts acoustic features from the input speech of the end user to identify and verify the end user, said speaker recognition subsystem comprises:
    - a speech activity detection module to detect speech activity of the end user;
      
      an MFCC computation module to calculate Mel Frequency Cepstral Coefficient along with cepstral mean and variance normalization of the speech activity and to generate feature vectors;
      
      a keyword spotter module to provide keyword spotting based enrollment and verification of the end user;
      
      a DNN-UBM based I-vector extractor module to produce I-vectors using a deep neural network-universal background model and a probabilistic linear discriminant analysis (PLDA) based classifier module to classify the identity of the end user.
  - 9. The system of claim 1, wherein the system further comprises an age and an emotion identification subsystem that detects the age and emotional state of the end user.
  - 10. The system of claim 9, wherein the age and emotion identification subsystem comprises a speech activity detection module to detect speech information and to generate an output for an MFCC computation module;
    - wherein said MFCC computation module performs analysis of the acoustic features followed by cepstral mean and variance normalization of the input speech to identify the age and emotion of the end user;
      
      a DNN-UBM based I-vector extractor to generate I-vector for the identified acoustic features; and
      
      a logistic regression classifier to classify the I-vectors to identify the end user'"'"'s age and emotion.
  - 11. The system of claim 9, wherein the age and emotion identification subsystem comprises an end to end LSTM-RNN based DNN classifier;
    - said end to end LSTM-RNN based DNN classifier has two convolutional layer followed by two Network-in-Network (NIN) layers which perform the role of feature extraction from raw waveforms; and
      
      the end to end LSTM-RNN based DNN classifier has 2 LSTM layers after the feature extraction layers followed by a softmax layer.
  - 12. The system of claim 11, wherein the end to end LSTM-RNN based DNN classifier has no separate acoustic feature extraction module at a signal processing level and raw speech frames obtained from end user'"'"'s input speech waveform are directly presented to the input layer of the DNN.
  - 13. The system of claim 9, wherein the emotion identification system provides provisions of both discrete and continuous classification of end user'"'"'s emotional level;
    - said discrete classification of end user'"'"'s emotion comprises classes like anger, happiness, anxiety, neutral, boredom and sadness; and
      
      the continuous classification of end user'"'"'s emotion provides rating of emotional level in two continuous scale called valence and arousal.

14. A method for providing real-time transcripts of spoken text, the method comprising:
- converting, by a speech to text engine, an input speech of an end user into a text input, the text input comprises one or more sequence of recognized word strings and confusions in text form; and
  
  receiving, by a semantic engine, the text input for producing one or more transcripts using a language model and extracting semantic meanings for said one or more transcript;
  
  wherein the semantic engine utilizes a grammar model and the language model to extract meaning for said one or more transcripts.
- View Dependent Claims (15, 16, 17, 18, 19)
- - 15. The method of claim 14, further comprises identifying and storing additional metadata about the speaker, selected from the group consisting of age, gender, accent and emotional state of the end user.
  - 16. The method of claim 14, wherein the sequence of recognized word strings are assigned with one or more action tags and entities.
  - 17. The method of claim 14 further comprises the step of extracting acoustic features from the input speech of the end user to identify language and/or accent the end user.
  - 18. The method of claim 14 further comprises the step of extracting acoustic features from the input speech of the end user to identify and verify the end user.
  - 19. The method of claim 14 further comprises the step of extracting acoustic and pitch features from the input speech to identify age and emotion of the end user.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Go-Vivace Inc.
Original Assignee
Go-Vivace Inc.
Inventors
Goel, Nagendra Kumar, Sarma, Mousmita

Granted Patent

US 10,347,244 B2
Time in Patent Office

Days
Field of Search
US Class Current
CPC Class Codes

G10L 15/005   Language recognition

G10L 15/02   Feature extraction for spee...

G10L 15/1815   Semantic context, e.g. disa...

G10L 15/22   Procedures used during a sp...

G10L 15/26   Speech to text systems G10L...

G10L 17/02   Preprocessing operations, e...

G10L 17/18   Artificial neural networks;...

G10L 2015/223   Execution procedure of a sp...

G10L 2015/225   Feedback of the input speech

G10L 2015/227   of the speaker; Human-fact...

G10L 25/24   the extracted parameters be...

G10L 25/63   for estimating an emotional...

G10L 25/78   Detection of presence or ab...

G10L 25/90   Pitch determination of spee...

Dialogue System Incorporating Unique Speech to Text Conversion Method for Meaningful Dialogue Response

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

19 Claims

Specification

Solutions

Use Cases

Quick Links

Dialogue System Incorporating Unique Speech to Text Conversion Method for Meaningful Dialogue Response

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

19 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links