Content-aware speaker recognition
First Claim
Patent Images
1. A text-independent speaker recognition system comprising:
- a front end module embodied in one or more non-transitory computer readable media and executable by at least one computer device to;
process an audio signal comprising a current sample of natural language speech;
identify a speech segment in the current sample of natural language speech; and
create a phonetic representation of the speech segment of the current speech sample; and
a back end module embodied in one or more non-transitory computer readable media and executable by at least one computer device to;
create a current speaker model based on the phonetic representation of the speech segment of the current speech sample, the current speaker model mathematically representing at least one speaker-specific phonemic characteristic of the current speech sample; and
compare the current speaker model to a stored speaker model, the stored speaker model mathematically associating phonetic content with one or more other speech samples;
wherein the front end module is to apply a neural network-based acoustic model to associate the speech segment with phonetic content;
wherein the front end module is to align the phonetic content of the speech segment with time; and
wherein the front end module is to align the phonetic content of the speech segment in lexical units, and the back end module is to compute a distance between at least one of the lexical units of the phonetic content with a similar lexical unit of the stored speaker model.
1 Assignment
0 Petitions
Accused Products
Abstract
A content-aware speaker recognition system includes technologies to, among other things, analyze phonetic content of a speech sample, incorporate phonetic content of the speech sample into a speaker model, and use the phonetically-aware speaker model for speaker recognition.
28 Citations
9 Claims
-
1. A text-independent speaker recognition system comprising:
-
a front end module embodied in one or more non-transitory computer readable media and executable by at least one computer device to;
process an audio signal comprising a current sample of natural language speech;
identify a speech segment in the current sample of natural language speech; and
create a phonetic representation of the speech segment of the current speech sample; anda back end module embodied in one or more non-transitory computer readable media and executable by at least one computer device to;
create a current speaker model based on the phonetic representation of the speech segment of the current speech sample, the current speaker model mathematically representing at least one speaker-specific phonemic characteristic of the current speech sample; and
compare the current speaker model to a stored speaker model, the stored speaker model mathematically associating phonetic content with one or more other speech samples;wherein the front end module is to apply a neural network-based acoustic model to associate the speech segment with phonetic content; wherein the front end module is to align the phonetic content of the speech segment with time; and wherein the front end module is to align the phonetic content of the speech segment in lexical units, and the back end module is to compute a distance between at least one of the lexical units of the phonetic content with a similar lexical unit of the stored speaker model.
-
-
2. A text-independent speaker recognition system comprising:
-
a front end module embodied in one or more non-transitory computer readable media and executable by at least one computer device to;
process an audio signal comprising a current sample of natural language speech;
identify a speech segment in the current sample of natural language speech; and
create a phonetic representation of the speech segment of the current speech sample; anda back end module embodied in one or more non-transitory computer readable media and executable by at least one computer device to;
create a current speaker model based on the phonetic representation of the speech segment of the current speech sample, the current speaker model mathematically representing at least one speaker-specific phonemic characteristic of the current speech sample; and
compare the current speaker model to a stored speaker model, the stored speaker model mathematically associating phonetic content with one or more other speech samples;wherein the front end module is to apply a neural network-based acoustic model to associate the speech segment with phonetic content; wherein the front end module is to align the phonetic content of the speech segment with time; and wherein the front end module is to align the phonetic content of the speech segment in tri-phones, and the back end module is to compute a distance between at least one of the tri-phones of the phonetic content with a similar tri-phone of the stored speaker model.
-
-
3. A text-independent speaker recognition system comprising:
-
a front end module embodied in one or more non-transitory computer readable media and executable by at least one computer device to;
process an audio signal comprising a current sample of natural language speech;
identify a speech segment in the current sample of natural language speech; and
create a phonetic representation of the speech segment of the current speech sample; anda back end module embodied in one or more non-transitory computer readable media and executable by at least one computer device to;
create a current speaker model based on the phonetic representation of the speech segment of the current speech sample, the current speaker model mathematically representing at least one speaker-specific phonemic characteristic of the current speech sample; and
compare the current speaker model to a stored speaker model, the stored speaker model mathematically associating phonetic content with one or more other speech samples;wherein the front end module is to apply a neural network-based acoustic model to associate the speech segment with phonetic content; wherein the front end module is to align the phonetic content of the speech segment with time; wherein the front end module is to align the phonetic content of the speech segment in tri-phones, and the back end module is to compute a distance between at least one of the tri-phones of the phonetic content with a similar tri-phone of the stored speaker model; and wherein the back end module is to disregard tri-phones of the speech segment that do not have similar tri-phones in the stored speaker model.
-
-
4. A front end module for a text-independent speaker recognition system, the front end module embodied in one or more non-transitory computer readable media and executable by at least one computer device comprising a plurality of instructions embodied in one or more computer accessible storage media and executable by a processor to:
-
process an audio signal comprising a sample of natural language speech; identify a plurality of temporal speech segments in the natural language speech sample; assign a phonetic unit of a plurality of different phonetic units to each of the speech segments of the speech sample, wherein each of the phonetic units is associated with a class of phonetic content of a plurality of classes of phonetic content; and mathematically determine speaker-specific information about the pronunciation of the speech segments in comparison to the pronunciation of the speech segments by a general population; wherein the front end module is to apply a partial speech recognition system comprising a hidden Markov model and a deep neural network to associate different speech segments with different phonetic units; and wherein the front end module is to determine the phonetic unit to associate with a current speech segment based on a previously-determined association of a phonetic unit with another speech segment that is temporally adjacent the current speech segment in the natural language speech sample.
-
-
5. A front end module for a text-independent speaker recognition system, the front end module embodied in one or more non-transitory computer readable media and executable by at least one computer device comprising a plurality of instructions embodied in one or more computer accessible storage media and executable by a processor to:
-
process an audio signal comprising a sample of natural language speech; identify a plurality of temporal speech segments in the natural language speech sample; assign a phonetic unit of a plurality of different phonetic units to each of the speech segments of the speech sample, wherein each of the phonetic units is associated with a class of phonetic content of a plurality of classes of phonetic content; and mathematically determine speaker-specific information about the pronunciation of the speech segments in comparison to the pronunciation of the speech segments by a general population; wherein the front end module is to compute a plurality of statistics to determine the speaker-specific information, and the front end module is to compute the plurality of statistics using posterior probabilities of the phonetic classes.
-
-
6. A method for text-independent speaker recognition, the method comprising, with code embodied in one or more non-transitory computer readable media and executable by at least one computing device:
-
processing an audio signal comprising a current sample of natural language speech and speaker-specific information about the speaker of the current speech sample; executing a speech recognizer on the current speech sample to;
identify a speech segment in the current speech sample; and
create a phonetic representation of the speech segment, the phonetic representation mathematically associating phonetic content with the speech segment;executing a speaker recognizer to create a current speaker model comprising the speech segment, the phonetic representation of the speech segment, and the speaker-specific information; and mathematically computing a distance between the phonetic representation of the speech segment of the current speaker model and a similar phonetic representation of the stored speaker model.
-
-
7. A method for text-independent speaker recognition, the method comprising, with code embodied in one or more non-transitory computer readable media and executable by at least one computing device:
-
processing an audio signal comprising a current sample of natural language speech and speaker-specific information about the speaker of the current speech sample; executing a speech recognizer on the current speech sample to;
identify a speech segment in the current speech sample; and
create a phonetic representation of the speech segment, the phonetic representation mathematically associating phonetic content with the speech segment; andexecuting a speaker recognizer to create a current speaker model comprising the speech segment, the phonetic representation of the speech segment, and the speaker-specific information; wherein the phonetic representation of the speech segment comprises a phonetic unit assigned to a phonetic class, and the method comprises selecting a stored speaker model having a phonetic unit in the same phonetic class as the phonetic unit associated with the speech segment of the current speaker model.
-
-
8. A method for text-independent speaker recognition, the method comprising, with code embodied in one or more non-transitory computer readable media and executable by at least one computing device:
-
processing an audio signal comprising a current sample of natural language speech and speaker-specific information about the speaker of the current speech sample; executing a speech recognizer on the current speech sample to;
identify a speech segment in the current speech sample; and
create a phonetic representation of the speech segment, the phonetic representation mathematically associating phonetic content with the speech segment;
executing a speaker recognizer to create a current speaker model comprising the speech segment, the phonetic representation of the speech segment, and the speaker-specific information; andcomprising aligning the phonetic representation of the speech segment in tri-phones, analyzing the similarity of at least one of the tri-phones with a similar tri-phone of the stored speaker model, and disregarding tri-phones of the speech segment that do not have similar tri-phones in the stored speaker model.
-
-
9. A method for text-independent speaker recognition, the method comprising, with code embodied in one or more non-transitory computer readable media and executable by at least one computing device:
-
processing an audio signal comprising a current sample of natural language speech and speaker-specific information about the speaker of the current speech sample; executing a speech recognizer on the current speech sample to;
identify a speech segment in the current speech sample; and
create a phonetic representation of the speech segment, the phonetic representation mathematically associating phonetic content with the speech segment;
executing a speaker recognizer to create a current speaker model comprising the speech segment, the phonetic representation of the speech segment, and the speaker-specific information; andcomprising determining a phonetic representation to associate with the speech segment based on a previously-determined association of another phonetic representation with another speech segment that is temporally adjacent the speech segment in the current speech sample.
-
Specification