Dialogue act estimation method, dialogue act estimation apparatus, and storage medium
First Claim
1. A dialogue act estimation method, in a dialogue act estimation system, comprising:
- acquiring sounds by a microphone in a terminal;
determining, by a processor in the terminal, whether the acquired sounds are uttered sentences of one or more speakers or noise;
outputting the uttered sentences to communication transmitter only when the processor determines that the acquired sounds are uttered sentences of the one or more speakers and are not noise;
converting the uttered sentences of the one or more speakers to one or more formatted communication signals when the processor determines that the acquired sounds are uttered sentences of the one or more speakers;
transmitting the one or more formatted communication signals from the terminal over a communication network to a server;
receiving the one or more formatted communication signals by the server;
converting the received one or more formatted communication signals by a processor in the server to the uttered sentences of the one or more speakers;
acquiring first training data by the server from the converted uttered sentences of the one or more speakers indicating, in a mutually associated manner, text data of a first sentence that can be a current uttered sentence, text data of a second sentence that can be an uttered sentence immediately previous to the first sentence, first speaker change information indicating whether a speaker of the first sentence is the same as a speaker of the second sentence, and dialogue act information indicating a class of the first sentence;
learning an association between the current uttered sentence and the dialogue act information by applying the first training data to a model;
storing a result of the learning as learning result information in a memory in the server;
acquiring dialogue data including text data of a third sentence of a current uttered sentence uttered by a user, text data of a fourth sentence of an uttered sentence immediately previous to the third sentence, and second speaker change information indicating whether the speaker of the third sentence is the same as a speaker of the fourth sentence;
estimating a dialogue act to which the third sentence is classified by applying the dialogue data to the model based on the learning result information; and
generating a correct response to the uttered sentences of the one or more speakers,wherein the model includesa first model that outputs a first feature vector based on the text data of the first sentence, the text data of the second sentence, the first speaker identification information, the second speaker identification information, and a first weight parameter, anda second model that outputs a second feature vector based on the text data of the first sentence, the text data of the second sentence, the first speaker change information, and a second weight parameter,wherein the first model determines the first feature vector from the first sentence and the second sentence according to a first RNN-LSTM (Recurrent Neural Network-Long Short Term Memory) having the first weight parameter dependent on the first speaker identification information and the second speaker identification information, andwherein the second model determines the second feature vector from the first sentence and the second sentence according to a second RNN-LSTM having the second weight parameter dependent on first speaker change information.
1 Assignment
0 Petitions
Accused Products
Abstract
A dialogue act estimation method, in a dialogue act estimation apparatus, includes acquiring first training data indicating, in a mutually associated manner, text data of a first sentence that can be a current uttered sentence, and text data of a second sentence that can be an uttered sentence immediately previous to the first sentence. The method also includes speaker change information indicating whether a speaker of the first sentence is the same as a speaker of the second sentence, and dialogue act information indicating a class of the first sentence. The method further includes learning an association between the current uttered sentence and the dialogue act information by applying the first training data to a model, and storing a result of the learning as learning result information in a memory.
4 Citations
6 Claims
-
1. A dialogue act estimation method, in a dialogue act estimation system, comprising:
-
acquiring sounds by a microphone in a terminal; determining, by a processor in the terminal, whether the acquired sounds are uttered sentences of one or more speakers or noise; outputting the uttered sentences to communication transmitter only when the processor determines that the acquired sounds are uttered sentences of the one or more speakers and are not noise; converting the uttered sentences of the one or more speakers to one or more formatted communication signals when the processor determines that the acquired sounds are uttered sentences of the one or more speakers; transmitting the one or more formatted communication signals from the terminal over a communication network to a server; receiving the one or more formatted communication signals by the server; converting the received one or more formatted communication signals by a processor in the server to the uttered sentences of the one or more speakers; acquiring first training data by the server from the converted uttered sentences of the one or more speakers indicating, in a mutually associated manner, text data of a first sentence that can be a current uttered sentence, text data of a second sentence that can be an uttered sentence immediately previous to the first sentence, first speaker change information indicating whether a speaker of the first sentence is the same as a speaker of the second sentence, and dialogue act information indicating a class of the first sentence; learning an association between the current uttered sentence and the dialogue act information by applying the first training data to a model; storing a result of the learning as learning result information in a memory in the server; acquiring dialogue data including text data of a third sentence of a current uttered sentence uttered by a user, text data of a fourth sentence of an uttered sentence immediately previous to the third sentence, and second speaker change information indicating whether the speaker of the third sentence is the same as a speaker of the fourth sentence; estimating a dialogue act to which the third sentence is classified by applying the dialogue data to the model based on the learning result information; and generating a correct response to the uttered sentences of the one or more speakers, wherein the model includes a first model that outputs a first feature vector based on the text data of the first sentence, the text data of the second sentence, the first speaker identification information, the second speaker identification information, and a first weight parameter, and a second model that outputs a second feature vector based on the text data of the first sentence, the text data of the second sentence, the first speaker change information, and a second weight parameter, wherein the first model determines the first feature vector from the first sentence and the second sentence according to a first RNN-LSTM (Recurrent Neural Network-Long Short Term Memory) having the first weight parameter dependent on the first speaker identification information and the second speaker identification information, and wherein the second model determines the second feature vector from the first sentence and the second sentence according to a second RNN-LSTM having the second weight parameter dependent on first speaker change information. - View Dependent Claims (2, 3, 4)
-
-
5. A dialogue act estimation system, comprising:
-
a microphone in a terminal that acquires sounds; a processor in the terminal, that determines whether the acquired sounds are uttered sentences of one or more speakers or noise, outputs the uttered sentences only when the processor determines that the acquired sounds are uttered sentences of the one or more speakers and are not noise, converts the uttered sentences of the one or more speakers to one or more formatted communication signals when the processor determines that the acquired sounds are uttered sentences of the one or more speakers, and transmits the one or more formatted communication signals from the terminal over a communication network; and a server, that receives the one or more formatted communication signals; converts the received one or more formatted communication signals to the uttered sentences of the one or more speakers; and acquires first training data from the converted uttered sentences of the one or more speakers indicating, in a mutually associated manner, text data of a first sentence that can be a current uttered sentence, text data of a second sentence that can be an uttered sentence immediately previous to the first sentence, first speaker change information indicating whether a speaker of the first sentence is the same as a speaker of the second sentence, and dialogue act information indicating a class of the first sentence; learns an association between the current uttered sentence and the dialogue act information by applying the first training data to a model; and stores a result of the learning as learning result information in a memory, acquires dialogue data including text data of a third sentence of a current uttered sentence uttered by a user, text data of a fourth sentence of an uttered sentence immediately previous to the third sentence, and second speaker change information indicating whether the speaker of the third sentence is the same as a speaker of the fourth sentence; estimates a dialogue act to which the third sentence is classified by applying the dialogue data to the model based on the learning result information; and generates a correct response to the uttered sentences of the one or more speakers, wherein the model includes a first model that outputs a first feature vector based on the text data of the first sentence, the text data of the second sentence, the first speaker identification information, the second speaker identification information, and a first weight parameter, and a second model that outputs a second feature vector based on the text data of the first sentence, the text data of the second sentence, the first speaker change information, and a second weight parameter, wherein the first model determines the first feature vector from the first sentence and the second sentence according to a first RNN-LSTM (Recurrent Neural Network-Long Short Term Memory) having the first weight parameter dependent on the first speaker identification information and the second speaker identification information, and wherein the second model determines the second feature vector from the first sentence and the second sentence according to a second RNN-LSTM having the second weight parameter dependent on first speaker change information.
-
-
6. A plurality of non-transitory storage mediums storing computer-readable programs, the programs causing a plurality of computers to execute a process including:
-
acquiring sounds by a microphone in a terminal; determining, by a processor in the terminal, whether the acquired sounds are uttered sentences of one or more speakers or noise; outputting the uttered sentences to communication transmitter only when the processor determines that the acquired sounds are uttered sentences of the one or more speakers and are not noise; converting the uttered sentences of the one or more speakers to one or more formatted communication signals when the processor determines that the acquired sounds are uttered sentences of the one or more speakers; transmitting the one or more formatted communication signals from the terminal over a communication network to a server; receiving the one or more formatted communication signals by the server; converting the received one or more formatted communication signals by the server to the uttered sentences of the one or more speakers; acquiring first training data by the server from the converted uttered sentences of the one or more speakers indicating, in a mutually associated manner, text data of a first sentence that can be a current uttered sentence, text data of a second sentence that can be an uttered sentence immediately previous to the first sentence, first speaker change information indicating whether a speaker of the first sentence is the same as a speaker of the second sentence, and dialogue act information indicating a class of the first sentence; learning an association between the current uttered sentence and the dialogue act information by applying the first training data to a model; storing a result of the learning as learning result information in a memory in the server; acquiring dialogue data including text data of a third sentence of a current uttered sentence uttered by a user, text data of a fourth sentence of an uttered sentence immediately previous to the third sentence, and second speaker change information indicating whether the speaker of the third sentence is the same as a speaker of the fourth sentence; estimating a dialogue act to which the third sentence is classified by applying the dialogue data to the model based on the learning result information; and generating a correct response to the uttered sentences of the one or more speakers, wherein the model includes a first model that outputs a first feature vector based on the text data of the first sentence, the text data of the second sentence, the first speaker identification information, the second speaker identification information, and a first weight parameter, and a second model that outputs a second feature vector based on the text data of the first sentence, the text data of the second sentence, the first speaker change information, and a second weight parameter, wherein the first model determines the first feature vector from the first sentence and the second sentence according to a first RNN-LSTM (Recurrent Neural Network-Long Short Term Memory) having the first weight parameter dependent on the first speaker identification information and the second speaker identification information, and wherein the second model determines the second feature vector from the first sentence and the second sentence according to a second RNN-LSTM having the second weight parameter dependent on first speaker change information.
-
Specification