Speech representation by feature-based word prototypes comprising phoneme targets having reliable high similarity
First Claim
1. A method of electronically representing a given spoken word or phrase as a digital prototype comprising:
- providing phoneme templates representing a database of standard speech;
for a first training instance of said given word or phrase, comparing the first training instance with said phoneme templates to produce first phoneme similarity data as a function of time;
processing said first phoneme similarity data to extract first training instance features that exceed a predetermined similarity threshold;
for at second training instance of said given word or phrase, comparing the second training instance with said phoneme templates to produce second phoneme similarity data as a function of time;
processing said second phoneme similarity data to extract second training instance features that exceed a predetermined similarity threshold;
aligning the extracted first and second training instance features and selecting those features that achieve a predetermined degree of correlation between first and second training instances to produce time-dependent speaker-independent phoneme similarity data;
building word prototype targets corresponding to the time-dependent speaker-independent phoneme similarity data, the word prototype targets each including a phoneme symbol and at least one datum indicative of a phoneme similarity score;
using said phoneme symbol and said phoneme similarity score as a digital prototype to electronically represent the given speech utterance.
2 Assignments
0 Petitions
Accused Products
Abstract
Digitized speech utterances are converted into phoneme similarity data and regions of high similarity are then extracted and used in forming the word prototype. By alignment across speakers unreliable high phoneme similarity regions are eliminated. Word prototype targets are then constructed comprising the following parameters: the phoneme symbol, the average peak height of the phoneme similarity score, the average peak location and the left and right frame locations. For each target a statistical weight is assigned representing the percentage of occurrences the particular high similarity region occurred across all speakers. The word prototype is feature-based allowing a robust speech representation to be constructed without the need for frame-by-frame analysis.
103 Citations
45 Claims
-
1. A method of electronically representing a given spoken word or phrase as a digital prototype comprising:
-
providing phoneme templates representing a database of standard speech; for a first training instance of said given word or phrase, comparing the first training instance with said phoneme templates to produce first phoneme similarity data as a function of time; processing said first phoneme similarity data to extract first training instance features that exceed a predetermined similarity threshold; for at second training instance of said given word or phrase, comparing the second training instance with said phoneme templates to produce second phoneme similarity data as a function of time; processing said second phoneme similarity data to extract second training instance features that exceed a predetermined similarity threshold; aligning the extracted first and second training instance features and selecting those features that achieve a predetermined degree of correlation between first and second training instances to produce time-dependent speaker-independent phoneme similarity data; building word prototype targets corresponding to the time-dependent speaker-independent phoneme similarity data, the word prototype targets each including a phoneme symbol and at least one datum indicative of a phoneme similarity score; using said phoneme symbol and said phoneme similarity score as a digital prototype to electronically represent the given speech utterance. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21)
-
-
22. An apparatus for electronically representing a given speech utterance signal as a digital word prototype comprising:
-
a phoneme template for representing a database of calibration speech; means for comparing the utterance signal of said first speaker with said phoneme template to produce first speaker phoneme similarity data as a function of time for said given speech utterance signal of a first speaker; means for processing said first speaker phoneme similarity data to extract first speaker features that exceed a predetermined similarity threshold; means for comparing the utterance signal of said second speaker with said phoneme template to produce second speaker phoneme similarity data as a function of time for said given speech utterance signal of a second speaker; means for processing said second speaker phoneme similarity data to extract second speaker features that exceed a predetermined similarity threshold; means for aligning the extracted first and second speaker features and selecting those features that achieve a predetermined degree of correlation between first and second speakers to produce time-dependent speaker-independent phoneme similarity data; means for building word prototype targets corresponding to the time-dependent speaker-independent phoneme similarity data, the word prototype targets each including a phoneme symbol and at least one feature location datum indicative of a time location of that feature; means for using said word prototype targets as said digital word prototype to electronically represent the given speech utterance.
-
-
23. A method of electronically representing a given spoken word or phrase as a digital prototype, comprising:
-
providing phoneme templates representing a database of standard speech; providing a plurality of training instances of said given spoken word or phrase; for each training instance, comparing the training instance with said phoneme templates to produce training instance phoneme similarity data as a function of time; building a digital prototype corresponding to the time-dependent phoneme similarity data, said prototype consisting of at least a list of high phoneme similarity region targets, each target having a phoneme identifier and feature data including at least one time location datum indicative of a time location of that phoneme. - View Dependent Claims (24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43)
-
-
44. An apparatus for electronically representing a given spoken word or phrase as a digital prototype, comprising:
-
a phoneme template for representing a database of standard speech; means for receiving a plurality of training instances of a given spoken word or phrase; means for comparing each training instance of the spoken word or phrase with said phoneme template to produce training instance phoneme similarity data as a function of time for said training instance of the given word or phrase; means for processing said training instance phoneme similarity data to extract regions that exceed a predetermined phoneme similarity threshold; means for producing word prototype targets of said given word or phrase by incrementally merging said training instance phoneme similarity data; means for building target congruence prototypes corresponding to the time-dependent phoneme similarity data, the prototype targets comprising at least a list of phoneme targets, each including a phoneme identifier and feature data including at least one time location datum indicative of a time location of that phoneme; means for using said target congruence prototype as said digital prototype for said given spoken word or phrase. - View Dependent Claims (45)
-
Specification