Dialogue system incorporating unique speech to text conversion method for meaningful dialogue response
First Claim
1. A system for providing real-time transcripts of spoken text, the system comprising:
- a speech to text engine for converting an input speech of an end user into a text input, the text input comprises one or more sequences of recognized word strings or a word lattice in text form;
a semantic engine to receive the text input for producing one or more transcripts using a language model and extracting semantic meanings for said one or more transcripts;
wherein the semantic engine utilizes a grammar model and the language model to extract a meaning for said one or more transcripts;
an age and an emotion identification subsystem that detects the age and emotional state of the end user, said age and emotion identification subsystem comprises an end to end LSTM-RNN based DNN classifier;
said end to end classifier has two convolutional layers followed by two Network-in-Network (NIN) layers which performs the role of feature extraction from raw waveforms;
the end to end DNN classifier has 2 LSTM layers after the feature extraction layers followed by a softmax layer;
wherein the end to end DNN classifier has no separate acoustic feature extraction module at signal processing level and raw speech frames obtained from end user'"'"'s input speech waveform are directly presented to the input layer of the DNN.
1 Assignment
0 Petitions
Accused Products
Abstract
A real-time dialogue system that provides real-time transcription of the spoken text, with a sub-second delay by keeping track of word timings and word accuracy is provided. The system uses a grammar or a list of keywords to produce the transcripts by using a statistical language model. In addition, the system uses a deep neural network based I-vector system to constantly analyze the audio quality to assess and to identify additional metadata such as gender, language, accent, age, emotion and identity of an end user to enhance the response. The present invention provides a conversational dialogue system, to robustly identify certain specific user commands or intents, while otherwise allowing for a natural conversation, without switching between grammar based and natural language modes.
23 Citations
16 Claims
-
1. A system for providing real-time transcripts of spoken text, the system comprising:
-
a speech to text engine for converting an input speech of an end user into a text input, the text input comprises one or more sequences of recognized word strings or a word lattice in text form; a semantic engine to receive the text input for producing one or more transcripts using a language model and extracting semantic meanings for said one or more transcripts; wherein the semantic engine utilizes a grammar model and the language model to extract a meaning for said one or more transcripts; an age and an emotion identification subsystem that detects the age and emotional state of the end user, said age and emotion identification subsystem comprises an end to end LSTM-RNN based DNN classifier;
said end to end classifier has two convolutional layers followed by two Network-in-Network (NIN) layers which performs the role of feature extraction from raw waveforms;
the end to end DNN classifier has 2 LSTM layers after the feature extraction layers followed by a softmax layer;wherein the end to end DNN classifier has no separate acoustic feature extraction module at signal processing level and raw speech frames obtained from end user'"'"'s input speech waveform are directly presented to the input layer of the DNN. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A method for providing real-time transcripts of spoken text, the method comprising:
-
converting, by a speech to text engine, an input speech of an end user into a text input, the text input comprising one or more sequences of recognized word strings and confusions in text form; and receiving, by a semantic engine, the text input for producing one or more transcripts using a language model and extracting semantic meanings for said one or more transcripts; wherein the semantic engine utilizes a grammar model and the language model to extract meaning for said one or more transcripts detecting the age and emotional state of the end user by an age and emotion identification subsystem, comprising an end to end LSTM-RNN based DNN classifier;
said end to end classifier has two convolutional layers followed by two Network-in-Network (NIN) layers which performs the role of feature extraction from raw waveforms;
the end to end DNN classifier has 2 LSTM layers after the feature extraction layers followed by a softmax layer; and
the end to end DNN classifier has no separate acoustic feature extraction module at signal processing level and raw speech frames obtained from end user'"'"'s input speech waveform are directly presented to the input layer of the DNN. - View Dependent Claims (12, 13, 14, 15, 16)
-
Specification