Fast, language-independent method for user authentication by voice

US 9,218,809 B2
Filed: 01/09/2014
Issued: 12/22/2015
Est. Priority Date: 03/16/2000
Status: Expired due to Fees

First Claim

Patent Images

1. A method of speech-based user authentication, comprising:

at a device having one or more processors and memory;

receiving a plurality of different spoken utterances from a user;

for each of the plurality of different spoken utterances;

generating a respective phoneme-independent representation from the spoken utterance, the respective phoneme-independent representation including a respective spectral signature for each of a plurality of frames sampled from the spoken utterance; and

decomposing the respective phoneme-independent representation to obtain a respective content-independent recognition unit for the user;

calculating a content-independent recognition distribution value for the user based on the respective content-independent recognition units generated from the plurality of different spoken utterances; and

providing the content-independent recognition distribution value for use in a speaker authentication process.

View all claims

0 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and system for training a user authentication by voice signal are described. In one embodiment, a set of feature vectors are decomposed into speaker-specific recognition units. The speaker-specific recognition units are used to compute distribution values to train the voice signal. In addition, spectral feature vectors are decomposed into speaker-specific characteristic units which are compared to the speaker-specific distribution values. If the speaker-specific characteristic units are within a threshold limit of the speaker-specific distribution values, the speech signal is authenticated.

217 Citations

25 Claims

1. A method of speech-based user authentication, comprising:
- at a device having one or more processors and memory;
  
  receiving a plurality of different spoken utterances from a user;
  
  for each of the plurality of different spoken utterances;
  
  generating a respective phoneme-independent representation from the spoken utterance, the respective phoneme-independent representation including a respective spectral signature for each of a plurality of frames sampled from the spoken utterance; and
  
  decomposing the respective phoneme-independent representation to obtain a respective content-independent recognition unit for the user;
  
  calculating a content-independent recognition distribution value for the user based on the respective content-independent recognition units generated from the plurality of different spoken utterances; and
  
  providing the content-independent recognition distribution value for use in a speaker authentication process.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method of claim 1, wherein decomposing the respective phoneme-independent representation for each of the plurality of different spoken utterances further comprises:
    - applying a singular value decomposition to the respective phoneme-independent representation.
  - 3. The method of claim 1 further comprising:
    - generating the respective content-independent recognition unit from a singular value matrix of a singular value decomposition of the respective phoneme-independent representation for each of the plurality of different spoken utterances.
  - 4. The method of claim 1, wherein the speaker authentication process comprises:
    - decomposing at least one spectral signature of an input speech signal into at least one content-independent characteristic unit;
      
      comparing the at least one content-independent characteristic unit to the at least one content-independent recognition distribution value; and
      
      authenticating the input speech signal if the at least one content-independent characteristic unit is within a threshold limit of the at least one content-independent recognition distribution value.
  - 5. The method of claim 4, wherein decomposing the at least one spectral signature of the input speech signal into the at least one content-independent characteristic unit further comprises:
    - applying a singular value decomposition to the at least one spectral signature of the input speech signal.
  - 6. The method of claim 4, wherein:
    - for each of the plurality of different spoken utterances, the respective phoneme-independent representation is decomposed to further obtain a respective content reference sequence;
      
      the at least one spectral signature of the input speech signal is further decomposed into at least one content input sequence; and
      
      authenticating the input speech signal further comprises authenticating the input speech signal if the at least one content input sequence is similar to at least one of the respective content reference sequences.
  - 7. The method of claim 6, further comprising:
    - determining similarity based on a distance calculated between the at least one content input sequence and the at least one of the respective content reference sequences.

8. A method of authenticating a speech signal comprising:
- at a device having one or more processors and memory;
  
  receiving a spoken utterance of an unauthenticated speaker;
  
  generating a first phoneme-independent representation based on the spoken utterance;
  
  decomposing the first phoneme-independent representation into at least one content-independent characteristic unit;
  
  comparing the at least one content-independent characteristic unit to at least one content-independent recognition distribution value, the at least one content-independent recognition distribution value previously trained by a registered speaker and generated by decomposing a second phoneme-independent representation that is associated with the registered speaker to obtain at least one content-independent recognition unit; and
  
  authenticating the spoken utterance if the at least one content-independent characteristic unit is within a threshold limit of the at least one content-independent recognition distribution value.
- View Dependent Claims (9, 10, 11, 12, 13, 14, 15)
- - 9. The method of claim 8, further comprising:
    - generating the content-independent characteristic unit from a singular value matrix of a singular value decomposition of the first phoneme-independent representation.
  - 10. The method of claim 8, further comprising:
    - computing the at least one content-independent recognition distribution value from the at least one content-independent recognition unit.
  - 11. The method of claim 10, further comprising:
    - generating the at least one content-independent recognition unit from a singular value matrix of a singular value decomposition of the second plurality of phoneme-independent representation.
  - 12. The method of claim 10, wherein decomposing the first phoneme-independent representation further comprises:
    - applying a singular value decomposition to the first phoneme-independent representation.
  - 13. The method of claim 8, wherein decomposing the first phoneme-independent representation further comprises:
    - applying a singular value decomposition to the first phoneme-independent representation.
  - 14. The method of claim 8, wherein the first phoneme-independent representation is further decomposed into at least one content input sequence, and wherein authenticating the spoken utterance further comprises authenticating the spoken utterance if the at least one content input sequence is similar to at least one content reference sequence previously trained by the registered speaker.
  - 15. The method of claim 14, further comprising:
    - determining similarity based on a distance calculated between the at least one content input sequence and the at least one content reference sequence.

16. A non-transitory computer-readable storage medium comprising instructions for causing one or more processor to:
- receive a plurality of different spoken utterances from a user;
  
  for each of the plurality of different spoken utterances;
  
  generate a respective phoneme-independent representation from the spoken utterance, the respective phoneme-independent representation including a respective spectral signature for each of a plurality of frames sampled from the spoken utterance; and
  
  decompose the respective phoneme-independent representation to obtain a respective content-independent recognition unit for the user;
  
  calculate a content-independent recognition distribution value for the user based on the respective content-independent recognition units generated from the plurality of different spoken utterances; and
  
  provide the content-independent recognition distribution value for use in a speaker authentication process.

17. A non-transitory computer-readable storage medium comprising instructions for causing one or more processor to:
- receive a spoken utterance of an unauthenticated speaker;
  
  generate a first phoneme-independent representation based on the spoken utterance;
  
  decompose the first phoneme-independent representation into at least one content-independent characteristic unit;
  
  compare the at least one content-independent characteristic unit to at least one content-independent recognition distribution value, the at least one content-independent recognition distribution value previously trained by a registered speaker and generated by decomposing a second phoneme-independent representation associated with the registered speaker to obtain at least one content-independent recognition unit; and
  
  authenticate the spoken utterance if the at least one content-independent characteristic unit is within a threshold limit of the at least one content-independent recognition distribution value.

18. A system for speech-based user authentication, comprising:
- one or more processors;
  
  memory;
  
  one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs including instructions for;
  
  receiving a plurality of different spoken utterances from a user;
  
  for each of the plurality of different spoken utterances;
  
  generating a respective phoneme-independent representation from the spoken utterance, the respective phoneme-independent representation including a respective spectral signature for each of a plurality of frames sampled from the spoken utterance; and
  
  decomposing the respective phoneme-independent representation to obtain a respective content-independent recognition unit for the user;
  
  calculating a content-independent recognition distribution value for the user based on the respective content-independent recognition units generated from the plurality of different spoken utterances; and
  
  providing the content-independent recognition distribution value for use in a speaker authentication process.
- View Dependent Claims (19, 20, 21)
- - 19. The system of claim 18, wherein the one or more programs further include instructions for:
    - decomposing at least one spectral signature of an input speech signal into at least one content-independent characteristic unit;
      
      comparing the at least one content-independent characteristic unit to the at least one content-independent recognition distribution value; and
      
      authenticating the input speech signal if the at least one content-independent characteristic unit is within a threshold limit of the at least one content-independent recognition distribution value.
  - 20. The system of claim 18, wherein:
    - decomposing the respective phoneme-independent representation for each of the plurality of different spoken utterances further comprises decomposing the respective phoneme-independent representation to obtain a respective content reference sequence;
      
      decomposing the at least one spectral signature of the input speech signal further comprises decomposing the at least one spectral signature of the input speech signal into at least one content input sequence; and
      
      authenticating the input speech signal further comprises authenticating the input speech signal if the at least one content input sequence is similar to at least one of the respective content reference sequences.
  - 21. The system of claim 20, wherein the one or more programs include instructions for:
    - determining similarity based on a distance calculated between the at least one content input sequence and the at least one of the respective content reference sequences.

22. A system for authenticating a speech signal, comprising:
- one or more processors;
  
  memory;
  
  one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs including instructions for;
  
  receiving a spoken utterance of an unauthenticated speaker;
  
  generating a first phoneme-independent representation based on the spoken utterance;
  
  decomposing the first phoneme-independent representation into at least one content-independent characteristic unit;
  
  comparing the at least one content-independent characteristic unit to at least one content-independent recognition distribution value, the at least one content-independent recognition distribution value previously trained by a registered speaker and generated by decomposing a second phoneme-independent representation associated with the registered speaker to obtain at least one content-independent recognition unit; and
  
  authenticating the spoken utterance if the at least one content-independent characteristic unit is within a threshold limit of the at least one content-independent recognition distribution value.
- View Dependent Claims (23, 24, 25)
- - 23. The system of claim 22, wherein decomposing the first phoneme-independent representation further comprises:
    - applying a singular value decomposition to the first phoneme-independent representation.
  - 24. The system of claim 22, wherein decomposing the first phoneme-independent representation further comprises decomposing the first phoneme-independent representation into at least one content input sequence, and wherein authenticating the spoken utterance further comprises authenticating the spoken utterance if the at least one content input sequence is similar to at least one content reference sequence previously trained by the registered speaker.
  - 25. The system of claim 24, wherein the one or more programs further include instructions for:
    - determining similarity based on a distance calculated between the at least one content input sequence and the at least one content reference sequence.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Apple Inc.
Original Assignee
Apple Inc.
Inventors
Bellegarda, Jerome R., Silverman, Kim E. A.
Primary Examiner(s)
Armstrong, Angela A

Application Number

US14/151,605
Publication Number

US 20140195237A1
Time in Patent Office

712 Days
Field of Search

704/250, 704/270, 704/273
US Class Current

1/1
CPC Class Codes

G10L 15/07   to the speaker

G10L 17/04   Training, enrolment or mode...

G10L 17/08   Use of distortion metrics o...

G10L 17/14   Use of phonemic categorisat...

G10L 17/22   Interactive procedures; Man...

Fast, language-independent method for user authentication by voice

First Claim

0 Assignments

0 Petitions

Accused Products

Abstract

217 Citations

25 Claims

Specification

Solutions

Use Cases

Quick Links

Fast, language-independent method for user authentication by voice

First Claim

0 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

217 Citations

25 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links