System for automatic extraction of structure from spoken conversation using lexical and acoustic features

US 10,592,611 B2
Filed: 10/24/2016
Issued: 03/17/2020
Est. Priority Date: 10/24/2016
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method for extracting conversational structure from a spoken conversation, comprising:

obtaining, by a computer system comprising a set of processors, a voice record of the spoken conversation;

dividing the voice record into at least three sequential utterances spoken by two different speakers;

extracting a lexical feature from a respective utterance of the three sequential utterances in the voice record using an automatic speech recognition;

extracting an acoustic feature from a respective utterance of the three sequential utterances in the voice record; and

determining, via a machine learning method and based on the extracted lexical and acoustic features, a coarse-level conversational structure of the spoken conversation, which involves;

determining a plurality of fine-level conversational activities based on the three sequential utterances, wherein a respective fine-level conversational activity is an activity identified based on respective words spoken by the two different speakers;

determining, based on the plurality of fine-level conversational activities, a plurality of coarse-level conversational activities, wherein a respective coarse-level conversational activity of the plurality of coarse-level conversational activities indicates a phase of the spoken conversation that includes multiple fine-level conversational activities of the plurality of fine-level conversational activities; and

generating the coarse-level conversational structure from the plurality of coarse-level conversational activities, wherein the coarse-level conversational structure indicates a high-level structure of the spoken conversation spanning the three sequential utterances.

View all claims

9 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Embodiments of the present invention provide a system for automatically extracting conversational structure from a voice record based on lexical and acoustic features. The system also aggregates business-relevant statistics and entities from a collection of spoken conversations. The system may infer a coarse-level conversational structure based on fine-level activities identified from extracted acoustic features. The system improves significantly over previous systems by extracting structure based on lexical and acoustic features. This enables extracting conversational structure on a larger scale and finer level of detail than previous systems, and can feed an analytics and business intelligence platform, e.g. for customer service phone calls. During operation, the system obtains a voice record. The system then extracts a lexical feature using automatic speech recognition (ASR). The system extracts an acoustic feature. The system then determines, via machine learning and based on the extracted lexical and acoustic features, a coarse-level structure of the conversation.

25 Citations

20 Claims

1. A computer-implemented method for extracting conversational structure from a spoken conversation, comprising:
- obtaining, by a computer system comprising a set of processors, a voice record of the spoken conversation;
  
  dividing the voice record into at least three sequential utterances spoken by two different speakers;
  
  extracting a lexical feature from a respective utterance of the three sequential utterances in the voice record using an automatic speech recognition;
  
  extracting an acoustic feature from a respective utterance of the three sequential utterances in the voice record; and
  
  determining, via a machine learning method and based on the extracted lexical and acoustic features, a coarse-level conversational structure of the spoken conversation, which involves;
  
  determining a plurality of fine-level conversational activities based on the three sequential utterances, wherein a respective fine-level conversational activity is an activity identified based on respective words spoken by the two different speakers;
  
  determining, based on the plurality of fine-level conversational activities, a plurality of coarse-level conversational activities, wherein a respective coarse-level conversational activity of the plurality of coarse-level conversational activities indicates a phase of the spoken conversation that includes multiple fine-level conversational activities of the plurality of fine-level conversational activities; and
  
  generating the coarse-level conversational structure from the plurality of coarse-level conversational activities, wherein the coarse-level conversational structure indicates a high-level structure of the spoken conversation spanning the three sequential utterances.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The method of claim 1:
    - wherein extracting the lexical feature further comprises generating a textual transcript of the voice record of the spoken conversation;
      
      wherein extracting the acoustic feature further comprises identifying, based on the extracted acoustic feature and the textual transcript, three fine-level conversational activities associated respectively with the three sequential utterances; and
      
      wherein determining the coarse-level conversational structure further comprises;
      
      identifying that a first and a second fine-level conversational activities of the three fine-level conversational activities correspond to a first coarse-level conversational activity; and
      
      identifying that a third fine-level activity of the three fine-level conversational activities transitions to a second coarse-level activity, which is different from the first coarse-level conversational activity.
  - 3. The method of claim 1, wherein the spoken conversation is a customer service conversation, and wherein the plurality of coarse-level conversational activities include one or more of:
    - opening;
      
      detail gathering;
      
      equipment identification;
      
      security questions;
      
      problem articulation;
      
      diagnostics;
      
      fix deployment;
      
      customer satisfaction questions;
      
      hold;
      
      transfer;
      
      pre-closing; and
      
      closing.
  - 4. The method of claim 1, wherein the extracted acoustic feature includes one or more of:
    - speaking pitch;
      
      speaking intensity;
      
      timing or length of an utterance;
      
      timing of silence or pauses;
      
      overlap of utterances;
      
      repetition of phrases, words, or word fragments;
      
      speaking rhythm;
      
      speaking rate;
      
      speaking intonation;
      
      laughter;
      
      a Mel-frequency cepstral coefficient (MFCC); and
      
      a derived acoustic feature.
  - 5. The method of claim 1, further comprising determining, via the machine learning method and based on the extracted lexical feature and acoustic feature, a fine-level activity structure of the spoken conversation, wherein the fine-level activity structure indicates a structure of one of the three sequential utterances.
  - 6. The method of claim 5, further comprising determining, via the machine learning method and based on the extracted lexical feature and acoustic feature, one or more intermediate-level structures of the spoken conversation.
  - 7. The method of claim 5, wherein the fine-level activity structure indicates a fine-level activity including one or more of:
    - an information request;
      
      a clarification request;
      
      a repetition request;
      
      an action request;
      
      pointing;
      
      a future action request;
      
      an alignment request;
      
      a continuer;
      
      a confirmation;
      
      a sequence closer;
      
      a correction;
      
      information provision;
      
      reporting activity status;
      
      waiting;
      
      reporting a future event; and
      
      reciprocity.
  - 8. The method of claim 1:
    - wherein the machine learning method comprises a sequence model; and
      
      wherein determining the coarse-level conversational structure further comprises tracking, using the sequence model, a global conversational state within the coarse-level conversational structure.
  - 9. The method of claim 1, further comprising:
    - computing for a user, via a business intelligence platform, an aggregate statistic, comprising a distribution over activities, categories, and/or entities, from a plurality of conversations that includes the spoken conversation; and
      
      extracting for the user, via the business intelligence platform, targeted information about the spoken conversation.

10. A non-transitory computer-readable storage medium storing instructions that when executed by a computer cause the computer to perform a method for extracting conversational structure from a spoken conversation, the method comprising:
- obtaining a voice record of the spoken conversation;
  
  dividing the voice record into at least three sequential utterances spoken by two different speakers;
  
  extracting a lexical feature from a respective utterance of the at least three sequential utterances in the voice record using speech recognition;
  
  extracting an acoustic feature from a respective utterance of the at least three sequential utterances in the voice record; and
  
  determining, via a machine learning method and based on the extracted lexical and acoustic features, a coarse-level conversational structure of the spoken conversation, which involves;
  
  determining a plurality of fine-level conversational activities based on the three sequential utterances, wherein a respective fine-level conversational activity is an activity identified based on respective words spoken by the two different speakers;
  
  determining, based on the plurality of fine-level conversational activities, a plurality of coarse-level conversational activities, wherein a respective coarse-level conversational activity of the plurality of coarse-level conversational activities indicates a phase of the spoken conversation that includes multiple fine-level conversational activities of the plurality of fine-level conversational activities; and
  
  generating the coarse-level conversational structure from the plurality of coarse-level conversational activities, wherein the coarse-level conversational structure indicates a high-level structure of the spoken conversation spanning the at least three sequential utterances.
- View Dependent Claims (11, 12, 13, 14, 15)
- - 11. The non-transitory computer-readable storage medium of claim 10:
    - wherein extracting the lexical feature further comprises generating a textual transcript of the voice record of the spoken conversation;
      
      wherein extracting the acoustic feature further comprises identifying, based on the extracted acoustic feature and the textual transcript, three fine-level conversational activities associated respectively with the three sequential utterances; and
      
      wherein determining the coarse-level conversational structure further comprises;
      
      identifying that a first and a second fine-level conversational activities of the three fine-level conversational activities correspond to a first coarse-level conversational activity; and
      
      identifying that a third fine-level activity of the three fine-level conversational activities transitions to a second coarse-level activity, which is different from the first coarse-level conversational activity.
  - 12. The non-transitory computer-readable storage medium of claim 10, wherein the extracted acoustic feature includes one or more of:
    - speaking pitch;
      
      speaking intensity;
      
      timing or length of an utterance;
      
      timing of silence or pauses;
      
      overlap of utterances;
      
      repetition of phrases, words, or word fragments;
      
      speaking rhythm;
      
      speaking rate;
      
      speaking intonation;
      
      laughter;
      
      a Mel-frequency cepstral coefficient (MFCC); and
      
      a derived acoustic feature.
  - 13. The non-transitory computer-readable storage medium of claim 10, wherein the method further comprises determining, via the machine learning method and based on the extracted lexical feature and acoustic feature, a fine-level activity structure of the spoken conversation, wherein the fine-level activity structure indicates a structure of one of the three sequential utterances.
  - 14. The non-transitory computer-readable storage medium of claim 13, wherein the method further comprises determining, via the machine learning method and based on the extracted lexical feature and acoustic feature, one or more intermediate-level structures of the spoken conversation.
  - 15. The non-transitory computer-readable storage medium of claim 10:
    - wherein the machine learning method comprises a sequence model; and
      
      wherein determining the coarse-level conversational structure further comprises tracking, using the sequence model, a global conversational state within the coarse-level conversational structure.

16. A computing system for extracting conversational structure from a spoken conversation, the system comprising:
- a set of processors; and
  
  a non-transitory computer-readable medium coupled to the set of processors storing instructions thereon that, when executed by the processors, cause the processors to perform a method for extracting conversational structure from a spoken conversation, the method comprising;
  
  obtaining a voice record of the spoken conversation;
  
  dividing the voice record into at least three sequential utterances spoken by two different speakers;
  
  extracting a lexical feature from a respective utterance of the three sequential utterances in the voice record using an automatic speech recognition;
  
  extracting an acoustic feature from a respective utterance of the three sequential utterances the voice record; and
  
  determining, via a machine learning method and based on the extracted lexical feature and acoustic feature, a coarse-level conversational structure of the spoken conversation, which involves;
  
  determining a plurality of fine-level conversational activities based on the three sequential utterances, wherein a respective fine-level conversational activity is an activity identified based on respective words spoken by the two different speakers;
  
  determining, based on the plurality of fine-level conversational activities, a plurality of coarse-level conversational activities, wherein a respective coarse-level conversational activity of the plurality of coarse-level conversational activities indicates a phase of the spoken conversation that includes multiple fine-level conversational activities of the plurality of fine-level conversational activities; and
  
  generating the coarse-level conversational structure from the plurality of coarse-level conversational activities, wherein the coarse-level conversational structure indicates a high-level structure of the spoken conversation spanning the three sequential utterances.
- View Dependent Claims (17, 18, 19, 20)
- - 17. The computing system of claim 16:
    - wherein extracting the lexical feature from the voice record further comprises generating a textual transcript of the voice record of the spoken conversation;
      
      wherein extracting the acoustic feature from the voice record further comprises identifying, based on the extracted acoustic feature and the textual transcript, three fine-level conversational activities associated respectively with the three sequential utterances; and
      
      wherein determining the coarse-level conversational structure of the spoken conversation further comprises;
      
      identifying that a first and a second fine-level conversational activities of the three fine-level conversational activities correspond to a first coarse-level conversational activity; and
      
      identifying that a third fine-level activity of the three fine-level conversational activities transitions to a second coarse-level activity, which is different from the first coarse-level conversational activity.
  - 18. The computing system of claim 16, wherein the extracted acoustic feature includes one or more of:
    - speaking pitch;
      
      speaking intensity;
      
      timing or length of an utterance;
      
      timing of silence or pauses;
      
      overlap of utterances;
      
      repetition of phrases, words, or word fragments;
      
      speaking rhythm;
      
      speaking rate;
      
      speaking intonation; and
      
      laughter;
      
      a Mel-frequency cepstral coefficient (MFCC); and
      
      a derived acoustic feature.
  - 19. The computing system of claim 16, wherein the method further comprises determining, via the machine learning method and based on the extracted lexical feature and acoustic feature, a fine-level activity structure of the spoken conversation, wherein the fine-level activity structure indicates a structure of one of the three sequential utterances.
  - 20. The computing system of claim 16:
    - wherein the machine learning method comprises a sequence model; and
      
      wherein determining the coarse-level conversational structure further comprises tracking, using the sequence model, a global conversational state within the coarse-level conversational structure.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Conduent Business Services, LLC (Conduent, Inc.)
Original Assignee
Conduent Business Services, LLC (Conduent, Inc.)
Inventors
Vig, Jesse, Arsikere, Harish, Szymanski, Margaret H., Plurkowski, Luke R., Dent, Kyle D., Bobrow, Daniel G., Davies, Daniel, Saund, Eric
Primary Examiner(s)
Colucci, Michael C

Application Number

US15/332,766
Publication Number

US 20180113854A1
Time in Patent Office

1,240 Days
Field of Search

704 9, 704278, 704260, 704254, 704246, 704245, 704235, 707776, 707739, 379 8818, 3792661, 37926509
US Class Current
CPC Class Codes

G06F 40/284   Lexical analysis, e.g. toke...

G06F 40/35   Discourse or dialogue repre...

G10L 15/26   Speech to text systems G10L...

G10L 25/48   specially adapted for parti...

H04M 2201/40   using speech recognition sp...

H04M 2203/357   Autocues for dialog assistance

H04M 3/51   Centralised call answering ...

System for automatic extraction of structure from spoken conversation using lexical and acoustic features

First Claim

9 Assignments

0 Petitions

Accused Products

Abstract

25 Citations

20 Claims

Specification

Use Cases

Quick Links

Others

System for automatic extraction of structure from spoken conversation using lexical and acoustic features

First Claim

9 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

25 Citations

20 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others