Goal segmentation in speech dialogs

US 10,236,017 B1
Filed: 09/29/2015
Issued: 03/19/2019
Est. Priority Date: 09/29/2015
Status: Active Grant

First Claim

Patent Images

1. A method comprising:

processing a first dialog turn comprising a user utterance, wherein processing the first dialog turn comprises;

performing automated speech recognition (ASR) on the user utterance to produce an ASR hypothesis corresponding to the first dialog turn; and

performing natural language understanding (NLU) on the ASR hypothesis to produce an NLU hypothesis corresponding to the first dialog turn;

determining a first feature set corresponding to the first dialog turn, wherein the first feature set corresponding to the first dialog turn indicates;

a time relationship between the first dialog turn and a second dialog turn;

a confidence level of the ASR hypothesis corresponding to the first dialog turn; and

a confidence level of the NLU hypothesis corresponding to the first dialog turn;

determining a second feature set corresponding to the second dialog turn; and

receiving a first goal boundary identifier, wherein the first goal boundary identifier delineates a first goal segment comprising a sequence of one or more dialog turns that relate to a corresponding goal of the user utterance;

analyzing one or more dependencies between the goal boundary identifier and the first feature set;

generating a predictive model based at least in part on the analyzing, wherein the predictive model is responsive to a third feature set corresponding to a third dialog turn to delineate a second goal segment;

determining, based at least in part on the predictive model and the third feature set, the second goal segment;

associating, with the second goal segment, a start label that indicates a beginning of the second goal segment; and

associating, with the second goal segment, an end label that indicates an end of the second goal segment.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A speech-based system is configured to interact with a user through speech to determine intents and goals of the user. The system may analyze multiple dialog turns in order to determine and fully define a goal that the user is trying to express. Each dialog turn comprises a user utterance. Each dialog turn may also comprise a system speech response. In order to evaluate the performance of the system, logged data is analyzed to identify goal segments within the logged data, where a goal segment is a sequence of dialog turns that relate to a corresponding user goal. A subset of the dialog turns is annotated manually to delineate goal segments. A predictive model is then constructed based on the manually annotated goal segments. The predictive model is then used to identify goal segments formed by additional dialog turns.

51 Citations

View as Search Results

20 Claims

1. A method comprising:
- processing a first dialog turn comprising a user utterance, wherein processing the first dialog turn comprises;
  
  performing automated speech recognition (ASR) on the user utterance to produce an ASR hypothesis corresponding to the first dialog turn; and
  
  performing natural language understanding (NLU) on the ASR hypothesis to produce an NLU hypothesis corresponding to the first dialog turn;
  
  determining a first feature set corresponding to the first dialog turn, wherein the first feature set corresponding to the first dialog turn indicates;
  
  a time relationship between the first dialog turn and a second dialog turn;
  
  a confidence level of the ASR hypothesis corresponding to the first dialog turn; and
  
  a confidence level of the NLU hypothesis corresponding to the first dialog turn;
  
  determining a second feature set corresponding to the second dialog turn; and
  
  receiving a first goal boundary identifier, wherein the first goal boundary identifier delineates a first goal segment comprising a sequence of one or more dialog turns that relate to a corresponding goal of the user utterance;
  
  analyzing one or more dependencies between the goal boundary identifier and the first feature set;
  
  generating a predictive model based at least in part on the analyzing, wherein the predictive model is responsive to a third feature set corresponding to a third dialog turn to delineate a second goal segment;
  
  determining, based at least in part on the predictive model and the third feature set, the second goal segment;
  
  associating, with the second goal segment, a start label that indicates a beginning of the second goal segment; and
  
  associating, with the second goal segment, an end label that indicates an end of the second goal segment.
- View Dependent Claims (2, 3)
- - 2. The method of claim 1, wherein determining the first feature set comprises analyzing logged data associated with multiple dialog turns.
  - 3. The method of claim 1, wherein the first feature set corresponding to the first dialog turn comprises a feature from a fourth dialog turn that precedes the first dialog turn and a fifth dialog turn that follows the first dialog turn.

4. A system comprising:
- one or more processors; and
  
  one or more non-transitory computer-readable media storing computer-executable instructions that, when executed on the one or more processors, cause the one or more processors to perform actions comprising;
  
  determining a first feature set corresponding to a first dialog turn that is stored in a set of logged data, the first dialog turn corresponding to a first user utterance;
  
  receiving a first goal boundary identifier indicating that the first dialog turn represents a boundary of a first goal segment, the first goal segment comprising a first sequence of one or more dialog turns that relate to a first user goal;
  
  generating a predictive model based at least in part on the first feature set and the first goal boundary identifier;
  
  determining a second feature set of a second dialog turn corresponding to a second user utterance;
  
  analyzing the second feature set using the predictive model;
  
  determining, based at least in part on the analyzing, that the second dialog turn represents a boundary of a second goal segment, the second goal segment comprising a second sequence of one or more dialog turns that relate to a second user goal; and
  
  storing a second goal boundary identifier indicating that the second dialog turn represents the boundary of the second goal segment.
- View Dependent Claims (5, 6, 7, 8, 9, 10, 11, 12)
- - 5. The system of claim 4, wherein determining the first feature set comprises determining the first feature set based at least in part on at least one of a third dialog turn that precedes the first dialog turn or a fourth dialog turn that follows the first dialog turn.
  - 6. The system of claim 4, wherein the first feature set comprises a confidence level of an automated speech recognition (ASR) hypothesis corresponding to the first dialog turn.
  - 7. The system of claim 4, wherein the first feature set comprises automated speech recognition (ASR) metadata corresponding to the first dialog turn and ASR metadata corresponding to a third dialog turn.
  - 8. The system of claim 4, wherein the first feature set comprises natural language understanding (NLU) metadata corresponding to the first dialog turn and NLU metadata corresponding to a third dialog turn.
  - 9. The system of claim 4, wherein the first feature set comprises one or more natural language understanding (NLU) hypotheses corresponding to the first dialog turn and one or more NLU hypotheses corresponding to a third dialog turn.
  - 10. The system of claim 4, wherein the first feature set comprises text corresponding to a first system response that is part of the first dialog turn and text corresponding to a second system response that is part of a third dialog turn.
  - 11. The system of claim 4, wherein the first feature set indicates whether a wake-word prefaced the first dialog turn.
  - 12. The system of claim 4, wherein the first feature set indicates an amount of time between the first dialog turn and a third dialog turn.

13. A method comprising:
- building a predictive model based at least in part on training data comprising at least (a) a first feature set corresponding to a first dialog turn and (b) a first goal boundary identifier indicating that the first dialog turn represents a boundary of a first goal segment, the first goal segment comprising a first sequence of one or more dialog turns that relate to a first user goal;
  
  receiving a second feature set corresponding to a second dialog turn;
  
  analyzing the second feature set using the predictive model; and
  
  determining, based at least in part on the analyzing, that the second dialog turn represents a boundary of a second goal segment comprising a sequence of one or more dialog turns that relate to a corresponding second user goal.
- View Dependent Claims (14, 15, 16, 17, 18, 19)
- - 14. The method of claim 13, wherein determining that the second dialog turn represents a boundary of the second goal segment comprises determining that the second dialog turn represents the second goal segment based at least in part on a third feature set of at least one of a third dialog turn that precedes the second dialog turn or a fourth dialog turn that follows the second dialog turn.
  - 15. The method of claim 13, wherein the first feature set comprises one or more features of a third dialog turn that precedes the first dialog turn and one or more features of a fourth dialog turn that follows the first dialog turn.
  - 16. The method of claim 13, wherein the first goal boundary identifier comprises a human annotations.
  - 17. The method of claim 13, wherein the predictive model comprises a conditional random fields model.
  - 18. The method of claim 13, wherein the first feature set includes data produced by automatic speech recognition (ASR).
  - 19. The method of claim 13, wherein the first feature set includes data produced by natural language understanding (NLU).

20. A system comprising:
- one or more processors; and
  
  one or more non-transitory computer-readable media storing computer-executable instructions that, when executed on the one or more processors, cause the one or more processors to perform actions comprising;
  
  generating a predictive model using a first feature set of a first dialog turn and an indication that the first dialog turn represents a first boundary of a first goal segment;
  
  determining a second feature set of a second dialog turn corresponding to a user utterance;
  
  analyzing the second feature set using the predictive model;
  
  determining, based at least in part on the analyzing, that the second dialog turn represents a second boundary of a second goal segment, the second goal segment comprising a sequence of one or more dialog turns that relate to a user goal; and
  
  storing an indication that the second dialog turn represents the second boundary of the second goal segment.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Original Assignee
Amazon Technologies, Inc. (Amazon.com, Inc.)
Inventors
Witt-Ehsani, Silke, Di Fabbrizio, Giuseppe Pino
Primary Examiner(s)
Ortiz-Sanchez, Michael

Application Number

US14/869,797
Time in Patent Office

1,267 Days
Field of Search
US Class Current
CPC Class Codes

G06F 40/216   using statistical methods

G06F 40/35   Discourse or dialogue repre...

G10L 15/22   Procedures used during a sp...

G10L 15/26   Speech to text systems G10L...

Goal segmentation in speech dialogs

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

51 Citations

20 Claims

Specification

Use Cases

Quick Links

Others

Goal segmentation in speech dialogs

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

51 Citations

20 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others