PROSODIC AND LEXICAL ADDRESSEE DETECTION

US 20140214421A1
Filed: 01/31/2013
Published: 07/31/2014
Est. Priority Date: 01/31/2013
Status: Active Grant

First Claim

Patent Images

1. A method for addressee detection, comprising:

receiving an utterance;

extracting prosodic features directly from a spoken signal corresponding to the utterance;

determining patterns of the prosodic features that characterize a speaking style of the utterance as either a human-computer (H-C) or a human-human (H-H) style, indicating whether speech is directed to a computer or another human;

characterizing a degree to which a new utterance conforms to the H-C or H-H styles; and

using the new utterance according to the characterization of the speaking style.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Prosodic features are used for discriminating computer-directed speech from human-directed speech. Statistics and models describing energy/intensity patterns over time, speech/pause distributions, pitch patterns, vocal effort features, and speech segment duration patterns may be used for prosodic modeling. The prosodic features for at least a portion of an utterance are monitored over a period of time to determine a shape associated with the utterance. A score may be determined to assist in classifying the current utterance as human directed or computer directed without relying on knowledge of preceding utterances or utterances following the current utterance. Outside data may be used for training lexical addressee detection systems for the H-H-C scenario. H-C training data can be obtained from a single-user H-C collection and that H-H speech can be modeled using general conversational speech. H-C and H-H language models may also be adapted using interpolation with small amounts of matched H-H-C data.

Citations

20 Claims

1. A method for addressee detection, comprising:
- receiving an utterance;
  
  extracting prosodic features directly from a spoken signal corresponding to the utterance;
  
  determining patterns of the prosodic features that characterize a speaking style of the utterance as either a human-computer (H-C) or a human-human (H-H) style, indicating whether speech is directed to a computer or another human;
  
  characterizing a degree to which a new utterance conforms to the H-C or H-H styles; and
  
  using the new utterance according to the characterization of the speaking style.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The method of claim 1, wherein the prosodic features that are used to characterize the utterance as the H-C or the H-H are based on temporal and/or spectral patterns comprising at least one of:
    - energy/intensity, pitch, voice quality/vocal effort features, and segmental durations.
  - 3. The method of claim 1, further comprising extracting energy-related features from the utterance using fixed-length temporal windows within the utterance.
  - 4. The method of claim 1, further comprising determining a peak count, a rate, a mean and max distance apart, a mean/max/min/stdev intensity value, and a location and a value for a highest peak of the utterance.
  - 5. The method of claim 1, further comprising determining speech activity information including a speaking rate and duration information for the utterance.
  - 6. The method of claim 1, further comprising determining a waveform duration of the utterance, lengths of initial and final nonspeech regions in the utterance, and a duration of nonspeech regions between words in the utterance.
  - 7. The method of claim 1, further comprising using a lexical model with prosodic features.
  - 8. The method of claim 7, further comprising training the lexical model used in a Human-Human-Computer dialog with out-of-domain data, with or without prosodic features.
  - 9. The method of claim 7, wherein the out-of-domain data is determined from anchor text, with or without prosodic features.
  - 10. The method of claim 7, wherein the lexical model comprises a similarity measure between the spoken words and display text.

11. A computer-readable medium storing computer-executable instructions for addressee detection, comprising:
- receiving an utterance;
  
  extracting prosodic features directly from a spoken signal corresponding to the utterance;
  
  determining patterns of the prosodic features that characterize a speaking style of the utterance as either a human-computer (H-C) or a human-human (H-H) style, indicating whether speech is directed to a computer or another human;
  
  characterizing a degree to which a new utterance conforms to the H-C or H-H styles; and
  
  using the new utterance according to the characterization of the speaking style.
- View Dependent Claims (12, 13, 14, 15, 16)
- - 12. The computer-readable medium of claim 11, wherein the prosodic features that are used to characterize the utterance as the H-C or the H-H are based on temporal and/or spectral patterns comprising at least one of:
    - energy/intensity, pitch, voice quality/vocal effort features, and segmental durations.
  - 13. The computer-readable medium of claim 11, further comprising extracting energy-related features from the utterance using fixed-length temporal windows within the utterance.
  - 14. The computer-readable medium of claim 11, further comprising determining a peak count, a rate, a mean and max distance apart, a mean/max/min/stdev intensity value, and a location and a value for a highest peak of the utterance.
  - 15. The computer-readable medium of claim 11, further comprising determining speech activity information including a speaking rate and duration information for the utterance.
  - 16. The computer-readable medium of claim 11, further comprising determining a waveform duration of the utterance, lengths of initial and final nonspeech regions in the utterance, and a duration of nonspeech regions between words in the utterance.

17. A system for addressee detection, comprising:
- a processor and memory;
  
  an operating environment executing using the processor;
  
  a display; and
  
  an addressee manager that is configured to perform actions comprising;
  
  receiving an utterance;
  
  extracting prosodic features directly from a spoken signal corresponding to the utterance;
  
  determining patterns of the prosodic features characterizing a speaking style of the utterance as directed toward a computer when the patterns indicate the utterance is directed to a computer;
  
  characterizing the speaking style of the utterance as directed toward a human when the patterns indicate the utterance is directed to a human;
  
  using a new utterance according to the characterization of the speaking style; and
  
  using a language model that is trained using a combination of out-of-domain data and in-domain data.
- View Dependent Claims (18, 19, 20)
- - 18. The system of claim 17, further comprising extracting energy-related features from the utterance using fixed-length temporal windows within the utterance.
  - 19. The system of claim 17, further comprising determining a peak count, a rate, a mean and max distance apart, a mean/max/min/stdev intensity value, and a location and a value for a highest peak of the utterance.
  - 20. The system of claim 17, further comprising determining speech activity information including a speaking rate and duration information for the utterance and determining a waveform duration of the utterance, lengths of initial and final nonspeech regions in the utterance, and a duration of nonspeech regions between words in the utterance.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Shriberg, Elizabeth, Stolcke, Andreas, Hakkani-Tur, Dilek, Heck, Larry, Lee, Heeyoung

Granted Patent

US 9,761,247 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/243
CPC Class Codes

G06F 40/284   Lexical analysis, e.g. toke...

G10L 15/063   Training

G10L 15/1807   using prosody or stress

G10L 15/183   using context dependencies,...

G10L 15/22   Procedures used during a sp...

G10L 25/03   characterised by the type o...

G10L 25/18   the extracted parameters be...

G10L 25/51   for comparison or discrimin...

G10L 25/60   for measuring the quality o...

G10L 25/87   Detection of discrete point...

G10L 25/90   Pitch determination of spee...

PROSODIC AND LEXICAL ADDRESSEE DETECTION

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

PROSODIC AND LEXICAL ADDRESSEE DETECTION

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links