SYSTEM FOR AUTOMATIC EXTRACTION OF STRUCTURE FROM SPOKEN CONVERSATION USING LEXICAL AND ACOUSTIC FEATURES

US 20180113854A1
Filed: 10/24/2016
Published: 04/26/2018
Est. Priority Date: 10/24/2016
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method for extracting structure from a spoken conversation, comprising:

obtaining, by a computer system comprising a set of processors, a voice record of the spoken conversation;

classifying the voice record into at least three sequential utterances spoken by two different speakers;

extracting a lexical feature from a respective utterance in the voice record using an automatic speech recognition (ASR) method;

extracting a non-verbal acoustic feature from a respective utterance in the voice record; and

determining, via a machine learning method and based on the extracted lexical and acoustic features, a coarse-level conversational structure of the spoken conversation comprising at least a first coarse-level conversational activity associated with two sequential utterances spoken by the two different speakers, and a second coarse-level conversational activity associated with a third utterance.

View all claims

9 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Embodiments of the present invention provide a system for automatically extracting conversational structure from a voice record based on lexical and acoustic features. The system also aggregates business-relevant statistics and entities from a collection of spoken conversations. The system may infer a coarse-level conversational structure based on fine-level activities identified from extracted acoustic features. The system improves significantly over previous systems by extracting structure based on lexical and acoustic features. This enables extracting conversational structure on a larger scale and finer level of detail than previous systems, and can feed an analytics and business intelligence platform, e.g. for customer service phone calls. During operation, the system obtains a voice record. The system then extracts a lexical feature using automatic speech recognition (ASR). The system extracts an acoustic feature. The system then determines, via machine learning and based on the extracted lexical and acoustic features, a coarse-level structure of the conversation.

Citations

20 Claims

1. A computer-implemented method for extracting structure from a spoken conversation, comprising:
- obtaining, by a computer system comprising a set of processors, a voice record of the spoken conversation;
  
  classifying the voice record into at least three sequential utterances spoken by two different speakers;
  
  extracting a lexical feature from a respective utterance in the voice record using an automatic speech recognition (ASR) method;
  
  extracting a non-verbal acoustic feature from a respective utterance in the voice record; and
  
  determining, via a machine learning method and based on the extracted lexical and acoustic features, a coarse-level conversational structure of the spoken conversation comprising at least a first coarse-level conversational activity associated with two sequential utterances spoken by the two different speakers, and a second coarse-level conversational activity associated with a third utterance.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The method of claim 1:
    - wherein extracting the lexical feature further comprises generating a textual transcript of the spoken conversation;
      
      wherein extracting the acoustic feature further comprises identifying, based on the extracted acoustic feature and the textual transcript, three fine-level conversational activities associated respectively with the three sequential utterances; and
      
      wherein determining the coarse-level conversational structure further comprises;
      
      identifying that two fine-level conversational activities associated with the two sequential utterances by the two speakers are together likely to correspond to the first coarse-level conversational activity; and
      
      identifying that a third fine-level activity associated with the third utterance makes likely a transition to the second coarse-level activity, which is different from the first coarse-level activity.
  - 3. The method of claim 2, wherein the spoken conversation is a customer service conversation, and wherein the likely coarse-level activity comprises one or more of:
    - opening;
      
      detail gathering;
      
      equipment identification;
      
      security questions;
      
      problem articulation;
      
      diagnostics;
      
      fix deployment;
      
      customer satisfaction questions;
      
      hold;
      
      transfer;
      
      pre-closing; and
      
      closing.
  - 4. The method of claim 1, wherein the extracted acoustic feature includes one or more of:
    - speaking pitch;
      
      speaking intensity;
      
      timing or length of an utterance;
      
      timing of silence or pauses;
      
      overlap of utterances;
      
      repetition of phrases, words, or word fragments;
      
      speaking rhythm;
      
      speaking rate;
      
      speaking intonation;
      
      laughter;
      
      a Mel-frequency cepstral coefficient (MFCC); and
      
      a derived acoustic feature.
  - 5. The method of claim 1, further comprising determining, via the machine learning method and based on the extracted lexical feature and acoustic feature, a fine-level activity structure of the spoken conversation.
  - 6. The method of claim 5, further comprising determining, via the machine learning method and based on the extracted lexical feature and acoustic feature, one or more intermediate-level structures of the spoken conversation.
  - 7. The method of claim 5, wherein the fine-level activity structure indicates a fine-level activity including one or more of:
    - an information request;
      
      a clarification request;
      
      a repetition request;
      
      an action request;
      
      pointing;
      
      a future action request;
      
      an alignment request;
      
      a continuer;
      
      a confirmation;
      
      a sequence closer;
      
      a correction;
      
      information provision;
      
      reporting activity status;
      
      waiting;
      
      reporting a future event; and
      
      reciprocity.
  - 8. The method of claim 1:
    - wherein the machine learning method comprises a sequence model; and
      
      wherein determining the coarse-level conversational structure further comprises tracking, by means of the sequence model, a global conversational state within the conversational structure.
  - 9. The method of claim 1, further comprising:
    - computing for a user, via a business intelligence platform, an aggregate statistic, comprising a distribution over activities, categories, and/or entities, from a plurality of conversations comprising the spoken conversation; and
      
      extracting for the user, via the business intelligence platform, targeted information about the spoken conversation.

10. A non-transitory computer-readable storage medium storing instructions that when executed by a computer cause the computer to perform a method for extracting structure from a spoken conversation, the method comprising:
- obtaining a voice record of the spoken conversation;
  
  classifying the voice record into at least three sequential utterances spoken by two different speakers;
  
  extracting a lexical feature from a respective utterance in the voice record using an automatic speech recognition (ASR) method;
  
  extracting a non-verbal acoustic feature from a respective utterance in the voice record; and
  
  determining, via a machine learning method and based on the extracted lexical and acoustic features, a coarse-level conversational structure of the spoken conversation comprising at least a first coarse-level conversational activity associated with two sequential utterances spoken by the two different speakers, and a second coarse-level conversational activity associated with a third utterance.
- View Dependent Claims (11, 12, 13, 14, 15)
- - 11. The non-transitory computer-readable storage medium of claim 10:
    - wherein extracting the lexical feature further comprises generating a textual transcript of the spoken conversation;
      
      wherein extracting the acoustic feature further comprises identifying, based on the extracted acoustic feature and the textual transcript, three fine-level conversational activities associated respectively with the three sequential utterances; and
      
      wherein determining the coarse-level conversational structure further comprises;
      
      identifying that two fine-level conversational activities associated with the two sequential utterances by the two speakers are together likely to correspond to the first coarse-level conversational activity; and
      
      identifying that a third fine-level activity associated with the third utterance makes likely a transition to the second coarse-level activity, which is different from the first coarse-level activity.
  - 12. The non-transitory computer-readable storage medium of claim 10, wherein the extracted acoustic feature includes one or more of:
    - speaking pitch;
      
      speaking intensity;
      
      timing or length of an utterance;
      
      timing of silence or pauses;
      
      overlap of utterances;
      
      repetition of phrases, words, or word fragments;
      
      speaking rhythm;
      
      speaking rate;
      
      speaking intonation;
      
      laughter;
      
      a Mel-frequency cepstral coefficient (MFCC); and
      
      a derived acoustic feature.
  - 13. The non-transitory computer-readable storage medium of claim 10, wherein the method further comprises determining, via the machine learning method and based on the extracted lexical feature and acoustic feature, a fine-level activity structure of the spoken conversation.
  - 14. The non-transitory computer-readable storage medium of claim 13, wherein the method further comprises determining, via the machine learning method and based on the extracted lexical feature and acoustic feature, one or more intermediate-level structures of the spoken conversation.
  - 15. The non-transitory computer-readable storage medium of claim 10:
    - wherein the machine learning method comprises a sequence model; and
      
      wherein determining the coarse-level conversational structure further comprises tracking, by means of the sequence model, a global conversational state within the conversational structure.

16. A computing system for extracting structure from a spoken conversation, the system comprising:
- a set of processors; and
  
  a non-transitory computer-readable medium coupled to the set of processors storing instructions thereon that, when executed by the processors, cause the processors to perform a method for extracting structure from a spoken conversation, the method comprising;
  
  obtaining a voice record of the spoken conversation;
  
  extracting a lexical feature from the voice record using an automatic speech recognition (ASR) method;
  
  extracting an acoustic feature from the voice record; and
  
  determining, via a machine learning method and based on the extracted lexical feature and acoustic feature, a coarse-level conversational structure of the spoken conversation.
- View Dependent Claims (17, 18, 19, 20)
- - 17. The computing system of claim 16:
    - wherein extracting the lexical feature from the voice record further comprises generating a textual transcript of the spoken conversation;
      
      wherein extracting the acoustic feature from the voice record further comprises identifying, based on the extracted acoustic feature and the textual transcript, a fine-level activity corresponding to a portion of the conversation; and
      
      wherein determining the coarse-level conversational structure of the spoken conversation further comprises inferring, based on the identified fine-level activity, a likely coarse-level activity corresponding to the portion of the conversation.
  - 18. The computing system of claim 16, wherein the extracted acoustic feature includes one or more of:
    - speaking pitch;
      
      speaking intensity;
      
      timing or length of an utterance;
      
      timing of silence or pauses;
      
      overlap of utterances;
      
      repetition of phrases, words, or word fragments;
      
      speaking rhythm;
      
      speaking rate;
      
      speaking intonation; and
      
      laughter;
      
      a Mel-frequency cepstral coefficient (MFCC); and
      
      a derived acoustic feature.
  - 19. The computing system of claim 16, wherein the method further comprises determining, via the machine learning method and based on the extracted lexical feature and acoustic feature, a fine-level activity structure of the spoken conversation.
  - 20. The computing system of claim 16:
    - wherein the machine learning method comprises a sequence model; and
      
      wherein determining the coarse-level conversational structure further comprises tracking, by means of the sequence model, a global conversational state within the conversational structure.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Conduent Business Services, LLC (Conduent, Inc.)
Original Assignee
Palo Alto Research Center, Inc. (Xerox Holdings Corp.)
Inventors
Vig, Jesse, Arsikere, Harish, Szymanski, Margaret H., Plurkowski, Luke R., Dent, Kyle D., Bobrow, Daniel G., Davies, Daniel, Saund, Eric

Granted Patent

US 10,592,611 B2
Time in Patent Office

Days
Field of Search
US Class Current
CPC Class Codes

G06F 40/284   Lexical analysis, e.g. toke...

G06F 40/35   Discourse or dialogue repre...

G10L 15/26   Speech to text systems G10L...

G10L 25/48   specially adapted for parti...

H04M 2201/40   using speech recognition

H04M 2203/357   Autocues for dialog assistance

H04M 3/51   Centralised call answering ...

SYSTEM FOR AUTOMATIC EXTRACTION OF STRUCTURE FROM SPOKEN CONVERSATION USING LEXICAL AND ACOUSTIC FEATURES

First Claim

9 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

SYSTEM FOR AUTOMATIC EXTRACTION OF STRUCTURE FROM SPOKEN CONVERSATION USING LEXICAL AND ACOUSTIC FEATURES

First Claim

9 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links