System for automatic extraction of structure from spoken conversation using lexical and acoustic features
First Claim
1. A computer-implemented method for extracting conversational structure from a spoken conversation, comprising:
- obtaining, by a computer system comprising a set of processors, a voice record of the spoken conversation;
dividing the voice record into at least three sequential utterances spoken by two different speakers;
extracting a lexical feature from a respective utterance of the three sequential utterances in the voice record using an automatic speech recognition;
extracting an acoustic feature from a respective utterance of the three sequential utterances in the voice record; and
determining, via a machine learning method and based on the extracted lexical and acoustic features, a coarse-level conversational structure of the spoken conversation, which involves;
determining a plurality of fine-level conversational activities based on the three sequential utterances, wherein a respective fine-level conversational activity is an activity identified based on respective words spoken by the two different speakers;
determining, based on the plurality of fine-level conversational activities, a plurality of coarse-level conversational activities, wherein a respective coarse-level conversational activity of the plurality of coarse-level conversational activities indicates a phase of the spoken conversation that includes multiple fine-level conversational activities of the plurality of fine-level conversational activities; and
generating the coarse-level conversational structure from the plurality of coarse-level conversational activities, wherein the coarse-level conversational structure indicates a high-level structure of the spoken conversation spanning the three sequential utterances.
9 Assignments
0 Petitions
Accused Products
Abstract
Embodiments of the present invention provide a system for automatically extracting conversational structure from a voice record based on lexical and acoustic features. The system also aggregates business-relevant statistics and entities from a collection of spoken conversations. The system may infer a coarse-level conversational structure based on fine-level activities identified from extracted acoustic features. The system improves significantly over previous systems by extracting structure based on lexical and acoustic features. This enables extracting conversational structure on a larger scale and finer level of detail than previous systems, and can feed an analytics and business intelligence platform, e.g. for customer service phone calls. During operation, the system obtains a voice record. The system then extracts a lexical feature using automatic speech recognition (ASR). The system extracts an acoustic feature. The system then determines, via machine learning and based on the extracted lexical and acoustic features, a coarse-level structure of the conversation.
25 Citations
20 Claims
-
1. A computer-implemented method for extracting conversational structure from a spoken conversation, comprising:
-
obtaining, by a computer system comprising a set of processors, a voice record of the spoken conversation; dividing the voice record into at least three sequential utterances spoken by two different speakers; extracting a lexical feature from a respective utterance of the three sequential utterances in the voice record using an automatic speech recognition; extracting an acoustic feature from a respective utterance of the three sequential utterances in the voice record; and determining, via a machine learning method and based on the extracted lexical and acoustic features, a coarse-level conversational structure of the spoken conversation, which involves; determining a plurality of fine-level conversational activities based on the three sequential utterances, wherein a respective fine-level conversational activity is an activity identified based on respective words spoken by the two different speakers; determining, based on the plurality of fine-level conversational activities, a plurality of coarse-level conversational activities, wherein a respective coarse-level conversational activity of the plurality of coarse-level conversational activities indicates a phase of the spoken conversation that includes multiple fine-level conversational activities of the plurality of fine-level conversational activities; and generating the coarse-level conversational structure from the plurality of coarse-level conversational activities, wherein the coarse-level conversational structure indicates a high-level structure of the spoken conversation spanning the three sequential utterances. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A non-transitory computer-readable storage medium storing instructions that when executed by a computer cause the computer to perform a method for extracting conversational structure from a spoken conversation, the method comprising:
-
obtaining a voice record of the spoken conversation; dividing the voice record into at least three sequential utterances spoken by two different speakers; extracting a lexical feature from a respective utterance of the at least three sequential utterances in the voice record using speech recognition; extracting an acoustic feature from a respective utterance of the at least three sequential utterances in the voice record; and determining, via a machine learning method and based on the extracted lexical and acoustic features, a coarse-level conversational structure of the spoken conversation, which involves; determining a plurality of fine-level conversational activities based on the three sequential utterances, wherein a respective fine-level conversational activity is an activity identified based on respective words spoken by the two different speakers; determining, based on the plurality of fine-level conversational activities, a plurality of coarse-level conversational activities, wherein a respective coarse-level conversational activity of the plurality of coarse-level conversational activities indicates a phase of the spoken conversation that includes multiple fine-level conversational activities of the plurality of fine-level conversational activities; and generating the coarse-level conversational structure from the plurality of coarse-level conversational activities, wherein the coarse-level conversational structure indicates a high-level structure of the spoken conversation spanning the at least three sequential utterances. - View Dependent Claims (11, 12, 13, 14, 15)
-
-
16. A computing system for extracting conversational structure from a spoken conversation, the system comprising:
-
a set of processors; and a non-transitory computer-readable medium coupled to the set of processors storing instructions thereon that, when executed by the processors, cause the processors to perform a method for extracting conversational structure from a spoken conversation, the method comprising; obtaining a voice record of the spoken conversation; dividing the voice record into at least three sequential utterances spoken by two different speakers; extracting a lexical feature from a respective utterance of the three sequential utterances in the voice record using an automatic speech recognition; extracting an acoustic feature from a respective utterance of the three sequential utterances the voice record; and determining, via a machine learning method and based on the extracted lexical feature and acoustic feature, a coarse-level conversational structure of the spoken conversation, which involves; determining a plurality of fine-level conversational activities based on the three sequential utterances, wherein a respective fine-level conversational activity is an activity identified based on respective words spoken by the two different speakers; determining, based on the plurality of fine-level conversational activities, a plurality of coarse-level conversational activities, wherein a respective coarse-level conversational activity of the plurality of coarse-level conversational activities indicates a phase of the spoken conversation that includes multiple fine-level conversational activities of the plurality of fine-level conversational activities; and generating the coarse-level conversational structure from the plurality of coarse-level conversational activities, wherein the coarse-level conversational structure indicates a high-level structure of the spoken conversation spanning the three sequential utterances. - View Dependent Claims (17, 18, 19, 20)
-
Specification