Segmentation approach for speech recognition systems

US 6,535,851 B1
Filed: 03/24/2000
Issued: 03/18/2003
Est. Priority Date: 03/24/2000
Status: Expired due to Term

First Claim

Patent Images

1. A method for automatically determining a set of phonetic units from a body of utterance data, the method comprising the computer-implemented steps of:

receiving the body of utterance data;

determining a first set of candidate phonetic units from the body of utterance data;

determining a set of no-cross regions from the body of utterance data wherein the no-cross regions correspond to a time span of utterance data having a high probability of containing a boundary between phonetic units;

filtering the first set of candidate phonetic units to generate a subset of candidate phonetic units therefrom wherein the filtering analyzes the candidate phonetic units to determine if the candidate spans a no-cross region for the utterance data such that the subset omits candidate phonetic units which spanned a no-cross region.

View all claims

5 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Phonetic units are identified in a body of utterance data according to a novel segmentation approach. A body of received utterance data is processed and a set of candidate phonetic unit boundaries is determined that defines a set of candidate phonetic units. The set of candidate phonetic unit boundaries is determined based upon changes in Cepstral coefficient values, changes in utterance energy, changes in phonetic classification, broad category analysis (retroflex, back vowels, front vowels) and sonorant onset detection. The set of candidate phonetic unit boundaries is filtered by priority and proximity to other candidate phonetic units and by silence regions. The set of candidate phonetic units is filtered using no-cross region analysis to generate a set of filtered candidate phonetic units. No-cross region analysis generally involves discarding candidate phonetic units that completely span an energy up, energy down, dip or broad category type no-cross region. Finally, a set of phonetic units is selected from the set of filtered candidate phonetic units based upon the probabilities of candidate boundaries defining the ends of the unit and within the unit.

Citations

45 Claims

1. A method for automatically determining a set of phonetic units from a body of utterance data, the method comprising the computer-implemented steps of:
- receiving the body of utterance data;
  
  determining a first set of candidate phonetic units from the body of utterance data;
  
  determining a set of no-cross regions from the body of utterance data wherein the no-cross regions correspond to a time span of utterance data having a high probability of containing a boundary between phonetic units;
  
  filtering the first set of candidate phonetic units to generate a subset of candidate phonetic units therefrom wherein the filtering analyzes the candidate phonetic units to determine if the candidate spans a no-cross region for the utterance data such that the subset omits candidate phonetic units which spanned a no-cross region.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
- - 2. The method as recited in claim 1, wherein the step of determining a set of candidate phonetic units from the body of utterance data includes identifying peaks in a smoothed Cepstral change measure.
  - 3. The method as recited in claim 1, wherein the step of determining a set of candidate phonetic units from the body of utterance data includes identifying peaks in a smoothed energy change measure.
  - 4. The method as recited in claim 1, wherein the step of determining a set of candidate phonetic units from the body of utterance data includes identifying peaks in a smoothed phonetic classification measure.
  - 5. The method as recited in claim 1, wherein the step of determining a set of candidate phonetic units from the body of utterance data includes identifying the presence of regions with a high probability of belonging to some broad phonetic category and adding boundaries at the edges.
  - 6. The method as recited in claim 1, wherein the step of determining a set of candidate phonetic units from the body of utterance data includes detecting the onset of a sonorant and adding an additional boundary to account for possible voiced stops.
  - 7. The method as recited in claim 1, wherein the step of determining a set of candidate phonetic units from the body of utterance data includes discarding one or more boundaries that are within a specified proximity to at least one other boundary having a higher priority.
  - 8. The method as recited in claim 1, wherein the step of determining a set of candidate phonetic units from the body of utterance data includes discarding one or more boundaries that are determined to be contained in a silence region.
  - 9. The method as recited in claim 1, wherein the step of filtering the set of candidate phonetic units using no-cross region analysis includes discarding one or more candidate phonetic units from the set of candidate phonetic units that completely span at least one no-cross region from the set of one or more no-cross regions.
  - 10. The method as recited in claim 9, wherein identifying a set of one or more no-cross regions includes identifying a change in utterance energy that satisfies specified no-cross region criteria.
  - 11. The method as recited in claim 10, wherein the specified no-cross region criteria includes a minimum increase in utterance energy and identifying a set of one or more no-cross regions includes identifying an increase in utterance energy that exceeds the minimum increase in utterance energy.
  - 12. The method as recited in claim 11, wherein the specified no-cross region criteria includes a minimum decrease in utterance energy and identifying a set of one or more no-cross regions includes identifying a decrease in utterance energy that exceeds the minimum decrease in utterance energy.
  - 13. The method as recited in claim 11, wherein the specified no-cross region criteria includes dip no-cross region criteria and identifying a set of one or more no-cross regions includes identifying a region of utterance data with a dip in energy that satisfies the dip no-cross region criteria.
  - 14. The method as recited in claim 11, wherein the specified no-cross region criteria includes broad category change no-cross region criteria and identifying a set of one or more no-cross regions includes identifying a region of utterance data over which the broad phonetic classification changed sufficiently to satisfy the broad category no-cross region criteria.
  - 15. The method as recited in claim 1, wherein the step of selecting the set of phonetic units from the set of filtered candidate phonetic units includes selecting the set of N number of phonetic units having the relatively highest probability of boundaries at the ends and relatively lowest probability of boundaries internal to the unit.

16. A computer-readable medium carrying one or more sequences or one or more instructions for automatically determining a set of phonetic units from a body of utterance data, the one or more sequences or one or more instructions including instructions which, when executed by one or more processors, cause the one or more processors to perform the steps of:
- receiving the body of utterance data;
  
  determining a first set of candidate phonetic units from the body of utterance data;
  
  determining a set of no-cross regions from the body of utterance data wherein the no-cross regions correspond to a time span of utterance data having a high probability of containing a boundary between phonetic units;
  
  filtering the first set of candidate phonetic units to generate a subset of candidate phonetic units therefrom wherein the filtering analyzes the candidate phonetic units to determine if the candidate spans a no-cross region for the utterance data such that the subset omits candidate phonetic units which spanned a no-cross region.
- View Dependent Claims (17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30)
- - 17. The computer-readable medium as recited in claim 16, wherein the step of determining a set of candidate phonetic units from the body of utterance data includes identifying peaks in a smoothed Cepstral change measure.
  - 18. The computer-readable medium as recited in claim 16, wherein the step of determining a set of candidate phonetic units from the body of utterance data includes identifying peaks in a smoothed energy change measure.
  - 19. The computer-readable medium as recited in claim 16, wherein the step of determining a set of candidate phonetic units from the body of utterance data includes identifying peaks in a smoothed phonetic classification measure.
  - 20. The computer-readable medium as recited in claim 16, wherein the step of determining a set of candidate phonetic units from the body of utterance data includes identifying the presence of regions with a high probability of belonging to some broad phonetic category and adding boundaries at the edges.
  - 21. The computer-readable medium as recited in claim 16, wherein the step of determining a set of candidate phonetic units from the body of utterance data includes detecting the onset of a sonorant and adding an additional boundary to account for possible voiced stops.
  - 22. The computer-readable medium as recited in claim 16, wherein the step of determining a set of candidate phonetic units from the body of utterance data includes discarding one or more boundaries that are within a specified proximity to at least one other boundary having a higher priority.
  - 23. The computer-readable medium as recited in claim 16, wherein the step of determining a set of candidate phonetic units from the body of utterance data includes discarding one or more boundaries that are determined to be contained in a silence region.
  - 24. The computer-readable medium as recited in claim 16, wherein the step of filtering the set of candidate phonetic units using no-cross region analysis includes discarding one or more candidate phonetic units from the set of candidate phonetic units that completely span at least one no-cross region from the set of one or more no-cross regions.
  - 25. The computer-readable medium as recited in claim 24, wherein identifying a set of one or more no-cross regions includes identifying a change in utterance energy that satisfies specified no-cross region criteria.
  - 26. The computer-readable medium as recited in claim 25, wherein the specified no-cross region criteria includes a minimum increase in utterance energy and identifying a set of one or more no-cross regions includes identifying an increase in utterance energy that exceeds the minimum increase in utterance energy.
  - 27. The computer-readable medium as recited in claim 26, wherein the specified no-cross region criteria includes a minimum decrease in utterance energy and identifying a set of one or more no-cross regions includes identifying a decrease in utterance energy that exceeds the minimum decrease in utterance energy.
  - 28. The computer-readable medium as recited in claim 26, wherein the specified no-cross region criteria includes dip no-cross region criteria and identifying a set of one or more no-cross regions includes identifying a region of utterance data with a dip in energy that satisfies the dip no-cross region criteria.
  - 29. The computer-readable medium as recited in claim 26, wherein the specified no-cross region criteria includes broad category change no-cross region criteria and identifying a set of one or more no-cross regions includes identifying a region of utterance data over which the broad phonetic classification changed sufficiently to satisfy the broad category no-cross region criteria.
  - 30. The computer-readable medium as recited in claim 16, wherein the step of selecting the set of phonetic units from the set of filtered candidate phonetic units includes selecting the set of N number of phonetic units having the relatively highest probability of boundaries at the ends and relatively lowest probability of boundaries internal to the unit.

31. A speech recognition system for automatically determining a set of phonetic units from a body of utterance data, the speech recognition system comprising:
- one or more processors; and
  
  a memory communicatively coupled to the one or more processors, wherein the memory includes one or more sequences or one or more instructions which, when executed by the one or more processors, cause the one or more processors to perform the steps of;
  
  receiving the body of utterance data;
  
  determining a first set of candidate phonetic units from the body of utterance data;
  
  determining a set of no-cross regions from the body of utterance data wherein the no-cross regions correspond to a time span of utterance data having a high probability of containing a boundary between phonetic units;
  
  filtering the first set of candidate phonetic units to generate a subset of candidate phonetic units therefrom wherein the filtering analyzes the candidate phonetic units to determine if the candidate spans a no-cross region for the utterance data such that the subset omits candidate phonetic units which spanned a no-cross region.
- View Dependent Claims (32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45)
- - 32. The speech recognition system as recited in claim 31, wherein the step of determining a set of candidate phonetic units from the body of utterance data includes identifying peaks in a smoothed Cepstral change measure.
  - 33. The speech recognition system as recited in claim 31, wherein the step of determining a set of candidate phonetic units from the body of utterance data includes identifying peaks in a smoothed energy change measure.
  - 34. The speech recognition system as recited in claim 31, wherein the step of determining a set of candidate phonetic units from the body of utterance data includes identifying peaks in a smoothed phonetic classification measure.
  - 35. The speech recognition system as recited in claim 31, wherein the step of determining a set of candidate phonetic units from the body of utterance data includes identifying the presence of regions with a high probability of belonging to some broad phonetic category and adding boundaries at the edges.
  - 36. The speech recognition system as recited in claim 31, wherein the step of determining a set of candidate phonetic units from the body of utterance data includes detecting the onset of a sonorant and adding an additional boundary to account for possible voiced stops.
  - 37. The speech recognition system as recited in claim 31, wherein the step of determining a set of candidate phonetic units from the body of utterance data includes discarding one or more boundaries that are within a specified proximity to at least one other boundary having a higher priority.
  - 38. The speech recognition system as recited in claim 31, wherein the step of determining a set of candidate phonetic units from the body of utterance data includes discarding one or more boundaries that are determined to be contained in a silence region.
  - 39. The speech recognition system as recited in claim 31, wherein the step of filtering the set of candidate phonetic units using no-cross region analysis includes identifying a set of one or more no-cross regions defined by the body of utterance data and discarding one or more candidate phonetic units from the set of candidate phonetic units that completely span at least one no-cross region from the set of one or more no-cross regions.
  - 40. The speech recognition system as recited in claim 39, wherein identifying a set of one or more no-cross regions includes identifying a change in utterance energy that satisfies specified no-cross region criteria.
  - 41. The speech recognition system as recited in claim 40, wherein the specified no-cross region criteria includes a minimum increase in utterance energy and identifying a set of one or more no-cross regions includes identifying an increase in utterance energy that exceeds the minimum increase in utterance energy.
  - 42. The speech recognition system as recited in claim 41, wherein the specified no-cross region criteria includes a minimum decrease in utterance energy and identifying a set of one or more no-cross regions includes identifying a decrease in utterance energy that exceeds the minimum decrease in utterance energy.
  - 43. The speech recognition system as recited in claim 41, wherein the specified no-cross region criteria includes dip no-cross region criteria and identifying a set of one or more no-cross regions includes identifying a region of utterance data with a dip in energy that satisfies the dip no-cross region criteria.
  - 44. The speech recognition system as recited in claim 41, wherein the specified no-cross region criteria includes broad category change no-cross region criteria and identifying a set of one or more no-cross regions includes identifying a region of utterance data over which the broad phonetic classification changed sufficiently to satisfy the broad category no-cross region criteria.
  - 45. The speech recognition system as recited in claim 31, wherein the step of selecting the set of phonetic units from the set of filtered candidate phonetic units includes selecting the set of N number of phonetic units having the relatively highest probability of boundaries at the ends and relatively lowest probability of boundaries internal to the unit.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
SpeechWorks International, Inc. (Microsoft Corporation)
Original Assignee
SpeechWorks International, Inc. (Microsoft Corporation)
Inventors
Phillips, Michael S., Fanty, Mark
Primary Examiner(s)
Banks-Harold, Marsha D.
Assistant Examiner(s)
ABEBE, DANIEL DEMELASH

Application Number

US09/534,707
Time in Patent Office

1,089 Days
Field of Search

704/200, 704/206, 704/232, 704/241, 704/242, 704/243, 704/245, 704/249, 704/256, 704/254
US Class Current

704/249
CPC Class Codes

G10L 15/04 Segmentation; Word boundary...

G10L 2015/025 Phonemes, fenemes or fenone...

Segmentation approach for speech recognition systems

First Claim

5 Assignments

0 Petitions

Accused Products

Abstract

Citations

45 Claims

Specification

Solutions

Use Cases

Quick Links

Segmentation approach for speech recognition systems

First Claim

5 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

45 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links