Content-aware speaker recognition

US 9,336,781 B2
Filed: 04/29/2014
Issued: 05/10/2016
Est. Priority Date: 10/17/2013
Status: Active Grant

First Claim

Patent Images

1. A text-independent speaker recognition system comprising:

a front end module embodied in one or more non-transitory computer readable media and executable by at least one computer device to;

process an audio signal comprising a current sample of natural language speech;

identify a speech segment in the current sample of natural language speech; and

create a phonetic representation of the speech segment of the current speech sample; and

a back end module embodied in one or more non-transitory computer readable media and executable by at least one computer device to;

create a current speaker model based on the phonetic representation of the speech segment of the current speech sample, the current speaker model mathematically representing at least one speaker-specific phonemic characteristic of the current speech sample; and

compare the current speaker model to a stored speaker model, the stored speaker model mathematically associating phonetic content with one or more other speech samples;

wherein the front end module is to apply a neural network-based acoustic model to associate the speech segment with phonetic content;

wherein the front end module is to align the phonetic content of the speech segment with time; and

wherein the front end module is to align the phonetic content of the speech segment in lexical units, and the back end module is to compute a distance between at least one of the lexical units of the phonetic content with a similar lexical unit of the stored speaker model.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A content-aware speaker recognition system includes technologies to, among other things, analyze phonetic content of a speech sample, incorporate phonetic content of the speech sample into a speaker model, and use the phonetically-aware speaker model for speaker recognition.

28 Citations

View as Search Results

9 Claims

1. A text-independent speaker recognition system comprising:
- a front end module embodied in one or more non-transitory computer readable media and executable by at least one computer device to;
  
  process an audio signal comprising a current sample of natural language speech;
  
  identify a speech segment in the current sample of natural language speech; and
  
  create a phonetic representation of the speech segment of the current speech sample; and
  
  a back end module embodied in one or more non-transitory computer readable media and executable by at least one computer device to;
  
  create a current speaker model based on the phonetic representation of the speech segment of the current speech sample, the current speaker model mathematically representing at least one speaker-specific phonemic characteristic of the current speech sample; and
  
  compare the current speaker model to a stored speaker model, the stored speaker model mathematically associating phonetic content with one or more other speech samples;
  
  wherein the front end module is to apply a neural network-based acoustic model to associate the speech segment with phonetic content;
  
  wherein the front end module is to align the phonetic content of the speech segment with time; and
  
  wherein the front end module is to align the phonetic content of the speech segment in lexical units, and the back end module is to compute a distance between at least one of the lexical units of the phonetic content with a similar lexical unit of the stored speaker model.

2. A text-independent speaker recognition system comprising:
- a front end module embodied in one or more non-transitory computer readable media and executable by at least one computer device to;
  
  process an audio signal comprising a current sample of natural language speech;
  
  identify a speech segment in the current sample of natural language speech; and
  
  create a phonetic representation of the speech segment of the current speech sample; and
  
  a back end module embodied in one or more non-transitory computer readable media and executable by at least one computer device to;
  
  create a current speaker model based on the phonetic representation of the speech segment of the current speech sample, the current speaker model mathematically representing at least one speaker-specific phonemic characteristic of the current speech sample; and
  
  compare the current speaker model to a stored speaker model, the stored speaker model mathematically associating phonetic content with one or more other speech samples;
  
  wherein the front end module is to apply a neural network-based acoustic model to associate the speech segment with phonetic content;
  
  wherein the front end module is to align the phonetic content of the speech segment with time; and
  
  wherein the front end module is to align the phonetic content of the speech segment in tri-phones, and the back end module is to compute a distance between at least one of the tri-phones of the phonetic content with a similar tri-phone of the stored speaker model.

3. A text-independent speaker recognition system comprising:
- a front end module embodied in one or more non-transitory computer readable media and executable by at least one computer device to;
  
  process an audio signal comprising a current sample of natural language speech;
  
  identify a speech segment in the current sample of natural language speech; and
  
  create a phonetic representation of the speech segment of the current speech sample; and
  
  a back end module embodied in one or more non-transitory computer readable media and executable by at least one computer device to;
  
  create a current speaker model based on the phonetic representation of the speech segment of the current speech sample, the current speaker model mathematically representing at least one speaker-specific phonemic characteristic of the current speech sample; and
  
  compare the current speaker model to a stored speaker model, the stored speaker model mathematically associating phonetic content with one or more other speech samples;
  
  wherein the front end module is to apply a neural network-based acoustic model to associate the speech segment with phonetic content;
  
  wherein the front end module is to align the phonetic content of the speech segment with time;
  
  wherein the front end module is to align the phonetic content of the speech segment in tri-phones, and the back end module is to compute a distance between at least one of the tri-phones of the phonetic content with a similar tri-phone of the stored speaker model; and
  
  wherein the back end module is to disregard tri-phones of the speech segment that do not have similar tri-phones in the stored speaker model.

4. A front end module for a text-independent speaker recognition system, the front end module embodied in one or more non-transitory computer readable media and executable by at least one computer device comprising a plurality of instructions embodied in one or more computer accessible storage media and executable by a processor to:
- process an audio signal comprising a sample of natural language speech;
  
  identify a plurality of temporal speech segments in the natural language speech sample;
  
  assign a phonetic unit of a plurality of different phonetic units to each of the speech segments of the speech sample, wherein each of the phonetic units is associated with a class of phonetic content of a plurality of classes of phonetic content; and
  
  mathematically determine speaker-specific information about the pronunciation of the speech segments in comparison to the pronunciation of the speech segments by a general population;
  
  wherein the front end module is to apply a partial speech recognition system comprising a hidden Markov model and a deep neural network to associate different speech segments with different phonetic units; and
  
  wherein the front end module is to determine the phonetic unit to associate with a current speech segment based on a previously-determined association of a phonetic unit with another speech segment that is temporally adjacent the current speech segment in the natural language speech sample.

5. A front end module for a text-independent speaker recognition system, the front end module embodied in one or more non-transitory computer readable media and executable by at least one computer device comprising a plurality of instructions embodied in one or more computer accessible storage media and executable by a processor to:
- process an audio signal comprising a sample of natural language speech;
  
  identify a plurality of temporal speech segments in the natural language speech sample;
  
  assign a phonetic unit of a plurality of different phonetic units to each of the speech segments of the speech sample, wherein each of the phonetic units is associated with a class of phonetic content of a plurality of classes of phonetic content; and
  
  mathematically determine speaker-specific information about the pronunciation of the speech segments in comparison to the pronunciation of the speech segments by a general population;
  
  wherein the front end module is to compute a plurality of statistics to determine the speaker-specific information, and the front end module is to compute the plurality of statistics using posterior probabilities of the phonetic classes.

6. A method for text-independent speaker recognition, the method comprising, with code embodied in one or more non-transitory computer readable media and executable by at least one computing device:
- processing an audio signal comprising a current sample of natural language speech and speaker-specific information about the speaker of the current speech sample;
  
  executing a speech recognizer on the current speech sample to;
  
  identify a speech segment in the current speech sample; and
  
  create a phonetic representation of the speech segment, the phonetic representation mathematically associating phonetic content with the speech segment;
  
  executing a speaker recognizer to create a current speaker model comprising the speech segment, the phonetic representation of the speech segment, and the speaker-specific information; and
  
  mathematically computing a distance between the phonetic representation of the speech segment of the current speaker model and a similar phonetic representation of the stored speaker model.

7. A method for text-independent speaker recognition, the method comprising, with code embodied in one or more non-transitory computer readable media and executable by at least one computing device:
- processing an audio signal comprising a current sample of natural language speech and speaker-specific information about the speaker of the current speech sample;
  
  executing a speech recognizer on the current speech sample to;
  
  identify a speech segment in the current speech sample; and
  
  create a phonetic representation of the speech segment, the phonetic representation mathematically associating phonetic content with the speech segment; and
  
  executing a speaker recognizer to create a current speaker model comprising the speech segment, the phonetic representation of the speech segment, and the speaker-specific information;
  
  wherein the phonetic representation of the speech segment comprises a phonetic unit assigned to a phonetic class, and the method comprises selecting a stored speaker model having a phonetic unit in the same phonetic class as the phonetic unit associated with the speech segment of the current speaker model.

8. A method for text-independent speaker recognition, the method comprising, with code embodied in one or more non-transitory computer readable media and executable by at least one computing device:
- processing an audio signal comprising a current sample of natural language speech and speaker-specific information about the speaker of the current speech sample;
  
  executing a speech recognizer on the current speech sample to;
  
  identify a speech segment in the current speech sample; and
  
  create a phonetic representation of the speech segment, the phonetic representation mathematically associating phonetic content with the speech segment;
  
  executing a speaker recognizer to create a current speaker model comprising the speech segment, the phonetic representation of the speech segment, and the speaker-specific information; and
  
  comprising aligning the phonetic representation of the speech segment in tri-phones, analyzing the similarity of at least one of the tri-phones with a similar tri-phone of the stored speaker model, and disregarding tri-phones of the speech segment that do not have similar tri-phones in the stored speaker model.

9. A method for text-independent speaker recognition, the method comprising, with code embodied in one or more non-transitory computer readable media and executable by at least one computing device:
- processing an audio signal comprising a current sample of natural language speech and speaker-specific information about the speaker of the current speech sample;
  
  executing a speech recognizer on the current speech sample to;
  
  identify a speech segment in the current speech sample; and
  
  create a phonetic representation of the speech segment, the phonetic representation mathematically associating phonetic content with the speech segment;
  
  executing a speaker recognizer to create a current speaker model comprising the speech segment, the phonetic representation of the speech segment, and the speaker-specific information; and
  
  comprising determining a phonetic representation to associate with the speech segment based on a previously-determined association of another phonetic representation with another speech segment that is temporally adjacent the speech segment in the current speech sample.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
SRI International, Inc.
Original Assignee
SRI International, Inc.
Inventors
Scheffer, Nicolas, Lei, Yun
Primary Examiner(s)
Singh, Satwant

Application Number

US14/264,916
Publication Number

US 20150112684A1
Time in Patent Office

742 Days
Field of Search

None
US Class Current

1/1
CPC Class Codes

G10L 17/14 Use of phonemic categorisat...

Content-aware speaker recognition

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

28 Citations

9 Claims

Specification

Use Cases

Quick Links

Others

Content-aware speaker recognition

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

28 Citations

9 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others