Apparatus, method and system for cross-speaker speech recognition for telecommunication applications

US 6,438,520 B1
Filed: 01/20/1999
Issued: 08/20/2002
Est. Priority Date: 01/20/1999
Status: Expired due to Fees

First Claim

Patent Images

1. A method for cross-speaker speech recognition for telecommunication systems, the method comprising:

(a) receiving incoming speech;

(b) generating a phonetic representation of the incoming speech with a first speaker-independent model having an unconstrained grammar with a plurality of phonemes, in which any second phoneme of the plurality of phonemes may occur following any first phoneme of the plurality of phonemes;

(c) determining a transcription parameter as a first correspondence of the incoming speech to the first speaker-independent model;

(d) selecting a first phoneme pattern, from a plurality of phoneme patterns, utilizing a second speaker-independent model having a grammar constrained by the plurality of phoneme patterns;

(e) determining a recognition parameter as a second correspondence of the incoming speech to the first phoneme pattern; and

(f) determining whether the input speech matches the first phoneme pattern based upon a third correspondence of the transcription parameter with the recognition parameter in accordance with a predetermined criterion.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The apparatus, method and system of the present invention provide for cross-speaker speech recognition, and are particularly suited for telecommunication applications such as automatic name (voice) dialing, message management, call return management, and incoming call screening. The method of the present invention includes receiving incoming speech, such as an incoming caller name, and generating a phonetic transcription of the incoming speech with a speaker-independent, hidden Markov model having an unconstrained grammar in which any phoneme may follow any other phoneme, followed by determining a transcription parameter as a likelihood of fit of the incoming speech to the speaker-independent model. The method further selects a first phoneme pattern, from a plurality of phoneme patterns, as having a highest likelihood of fit to the incoming speech, utilizing a speaker-independent, hidden Markov model having a grammar constrained by these phoneme patterns, followed by determining a recognition parameter as a likelihood of fit of the incoming speech to the selected, first phoneme pattern. The method then determines whether the input speech matches or collides with the first phoneme pattern based upon a correspondence of the transcription parameter with the recognition parameter in accordance with a predetermined criterion. In the preferred embodiment, this matching or collision determination is made as a function of a confidence ratio, the ratio of the transcription parameter to the recognition parameter, being within or less than a predetermined threshold value.

73 Citations

View as Search Results

58 Claims

1. A method for cross-speaker speech recognition for telecommunication systems, the method comprising:
- (a) receiving incoming speech;
  
  (b) generating a phonetic representation of the incoming speech with a first speaker-independent model having an unconstrained grammar with a plurality of phonemes, in which any second phoneme of the plurality of phonemes may occur following any first phoneme of the plurality of phonemes;
  
  (c) determining a transcription parameter as a first correspondence of the incoming speech to the first speaker-independent model;
  
  (d) selecting a first phoneme pattern, from a plurality of phoneme patterns, utilizing a second speaker-independent model having a grammar constrained by the plurality of phoneme patterns;
  
  (e) determining a recognition parameter as a second correspondence of the incoming speech to the first phoneme pattern; and
  
  (f) determining whether the input speech matches the first phoneme pattern based upon a third correspondence of the transcription parameter with the recognition parameter in accordance with a predetermined criterion.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18)
- - 2. The method of claim 1, wherein the first and second speaker-independent models are each a hidden Markov model.
  - 3. The method of claim 1, wherein the first and second speaker-independent models are subword (phoneme) based.
  - 4. The method of claim 1, wherein the phonetic representation is a phonetic transcription.
  - 5. The method of claim 1, wherein the plurality of phoneme patterns are generated from a plurality of speakers.
  - 6. The method of claim 1, wherein the incoming speech is from a first speaker of a plurality of speakers.
  - 7. The method of claim 1, wherein the first correspondence is a likelihood of fit.
  - 8. The method of claim 1, wherein the second correspondence is a likelihood of fit.
  - 9. The method of claim 1, wherein the third correspondence is a confidence ratio.
  - 10. The method of claim 1, wherein step (f) further comprises:
11. The method of claim 1, wherein step (f) further comprises:
- comparing the transcription parameter to the recognition parameter to form a confidence ratio;
  
  when the confidence ratio is less than a predetermined threshold, determining that the input speech matches the first phoneme pattern; and
  
  when the confidence ratio is not less than the predetermined threshold, determining that the input speech does not match the first phoneme pattern.
12. The method of claim 1, further comprising generating a name list, wherein generating the name list includes:
- receiving as incoming speech a first sample of a name and performing steps (b) through (f), inclusive, on the first sample; and
  
  when the first sample does not match the first phoneme pattern, including the phonetic representation of the first sample within the plurality of phoneme patterns.
13. The method of claim 1, further comprising generating a name list, wherein generating the name list includes:
- receiving as incoming speech a first sample of a name and performing steps (b) through (f), inclusive, on the first sample;
  
  when the first sample does not match the first phoneme pattern, initially including a phonetic representation of the first sample within the plurality of phoneme patterns, receiving as incoming speech a second sample of the name, and performing steps (b) through (f), inclusive, on the second sample; and
  
  determining whether the second sample matches the first sample and, when the second sample does match the first sample, including the name in the name list and including corresponding phonetic representations of both the first sample and the second sample in the plurality of phoneme patterns.
14. The method of claim 1, further comprising generating a message list, wherein generating the message list includes:
- receiving as incoming speech a caller name and performing steps (b) through (f), inclusive, on the caller name;
  
  when the caller name does not match the first phoneme pattern, including the caller name in the message list and indicating that one call has been received from the caller name;
  
  when the caller name does match the first phoneme pattern, incrementing a count of calls received from the caller name.
15. The method of claim 14, further comprising performing message playback, wherein performing message playback includes:
- receiving incoming speech;
  
  selecting the first phoneme pattern, from a subset of the a plurality of phoneme patterns corresponding to the message list, as the highest likelihood of fit to the incoming speech; and
  
  playing a first message associated with the first phoneme pattern.
16. The method of claim 15, further comprising:
- when a plurality of messages are associated with the first phoneme pattern, sequentially playing the plurality of messages.
17. The method of claim 1, further comprising performing call return, wherein performing call return includes:
- receiving incoming speech;
  
  selecting the first phoneme pattern, from a subset of the plurality of phoneme patterns corresponding to a name list and a message list, as the highest likelihood of fit to the incoming speech; and
  
  transmitting a telecommunication number associated with the first phoneme pattern.
18. The method of claim 1, further comprising performing incoming call screening, wherein the plurality of phoneme patterns are predetermined to correspond to a plurality of names on a call screening list of a subscriber, and performing incoming call screening includes:
- receiving an incoming call leg;
  
  receiving as incoming speech a caller name and performing steps (b) through (f), inclusive, on the caller name;
  
  when the caller name does not match the first phoneme pattern, transferring the incoming call leg to a message system;
  
  when the caller name does match the first phoneme pattern, transferring the incoming call leg to the subscriber.

19. An apparatus for cross-speaker speech recognition for telecommunication systems, the apparatus comprising:
- a network interface to receive incoming speech;
  
  a memory, the memory storing a plurality of phoneme patterns; and
  
  a processor coupled to the network interface and to the memory, wherein the processor, when operative, includes instructions to generate a phonetic representation of the incoming speech with a first speaker-independent model having an unconstrained grammar having a plurality of phonemes, in which any second phoneme of the plurality of phonemes may occur following any first phoneme of the plurality of phonemes and determine a transcription parameter as a first correspondence of the incoming speech to the first speaker-independent model;
  
  the processor including further instructions to select a first phoneme pattern, from the plurality of phoneme patterns, utilizing a second speaker-independent model having a grammar constrained by the plurality of phoneme patterns, and to determine a recognition parameter as a second correspondence of the incoming speech to the first phoneme pattern; and
  
  the processor including further instructions to determine whether the input speech matches the first phoneme pattern based upon a third correspondence of the transcription parameter with the recognition parameter in accordance with a predetermined criterion.
- View Dependent Claims (20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36)
- - 20. The apparatus of claim 19, wherein the first and second speaker-independent models are each a hidden Markov model.
  - 21. The apparatus of claim 19, wherein the first and second speaker-independent models are subword (phoneme) based.
  - 22. The apparatus of claim 19, wherein the phonetic representation is a phonetic transcription.
  - 23. The apparatus of claim 19, wherein the plurality of phoneme patterns are generated from a plurality of speakers.
  - 24. The apparatus of claim 19, wherein the incoming speech is from a first speaker of a plurality of speakers.
  - 25. The apparatus of claim 19, wherein the first correspondence is a likelihood of fit.
  - 26. The apparatus of claim 19, wherein the second correspondence is a likelihood of fit.
  - 27. The apparatus of claim 19, wherein the third correspondence is a confidence ratio.
  - 28. The apparatus of claim 19, wherein the processor includes further instructions to determine that the input speech matches the first phoneme pattern when the transcription parameter compares with the recognition parameter in accordance with the predetermined criterion;
    - and to determine that the input speech does not match the first phoneme pattern when the transcription parameter does not compare with the recognition parameter in accordance with the predetermined criterion.
  - 29. The apparatus of claim 19, wherein the processor includes further instructions to compare the transcription parameter to the recognition parameter to form a confidence ratio;
    - when the confidence ratio is less than a predetermined threshold, to determine that the input speech matches the first phoneme pattern; and
      
      when the confidence ratio is not less than the predetermined threshold, to determine that the input speech does not match the first phoneme pattern.
  - 30. The apparatus of claim 19, wherein the processor includes further instructions to generate a name list stored in the memory, and wherein generating the name list includes determining whether a first sample of a name, received as incoming speech by the network interface, matches the first phoneme pattern;
    - when the first sample does not match the first phoneme pattern, the processor including further instructions to include a phonetic representation of the first sample within the plurality of phoneme patterns.
  - 31. The apparatus of claim 19, wherein the processor includes further instructions to generate a name list stored in the memory, and wherein generating the name list includes determining whether a first sample of a name, received as incoming speech by the network interface, matches the first phoneme pattern;
    - when the first sample does not match the first phoneme pattern, the processor including further instructions to initially include a phonetic representation of the first sample within the plurality of phoneme patterns, and to determine whether a second sample of the name, received as incoming speech by the network interface, matches the first sample; and
      
      , when the second sample does match the first sample, the processor including further instructions to include the name in the name list and include corresponding phonetic representations of both the first sample and the second sample in the plurality of phoneme patterns stored in the memory.
  - 32. The apparatus of claim 19, wherein the processor includes further instructions to generate a message list stored in the memory, and wherein generating the message list includes determining whether a caller name, received as incoming speech by the network interface, matches the first phoneme pattern;
    - when the caller name does not match the first phoneme pattern, the processor including further instructions to include the caller name in the message list stored in the memory and to indicate that one call has been received from the caller name; and
      
      when the caller name does match the first phoneme pattern, the processor including further instructions to increment a count of calls received from the caller name and to store the incremented count in the memory.
  - 33. The apparatus of claim 32, wherein the processor includes further instructions to perform message playback, wherein performing message playback includes selecting the first phoneme pattern, from a subset of the a plurality of phoneme patterns corresponding to the message list, as the highest likelihood of fit to the incoming speech, and playing a first message associated with the first phoneme pattern.
  - 34. The apparatus of claim 33, wherein the processor includes further instructions, when a plurality of messages are associated with the first phoneme pattern, to sequentially play the plurality of messages.
  - 35. The apparatus of claim 19, wherein the processor includes further instructions to perform call return, wherein performing call return includes selecting the first phoneme pattern, from a subset of the plurality of phoneme patterns corresponding to a name list and a message list, as the highest likelihood of fit to the incoming speech;
    - and wherein the processor includes further instructions to direct the network interface to transmit a telecommunication number associated with the first phoneme pattern.
  - 36. The apparatus of claim 19, wherein the plurality of phoneme patterns are predetermined to correspond to a plurality of names on a call screening list of a subscriber stored in the memory, wherein the processor includes further instructions to perform incoming call screening, and wherein performing incoming call screening includes determining whether a caller name, received as incoming speech by the network interface in conjunction with an incoming call leg, matches a first phoneme pattern;
    - when the caller name does not match the first phoneme pattern, the processor including further instructions to transfer the incoming call leg to a message system; and
      
      when the caller name does match the first phoneme pattern, the processor including further instructions to direct the network interface to transfer the incoming call leg to the subscriber.

37. An system for cross-speaker speech recognition for telecommunication systems, the system comprising:
- a switch to receive an incoming call leg; and
  
  an adjunct network entity coupled to the switch, wherein the adjunct network entity, when operative, includes instructions to receive incoming speech, generate a phonetic representation of the incoming speech with a first speaker-independent model having an unconstrained grammar having a plurality of phonemes, in which any second phoneme of the plurality of phonemes may occur following any first phoneme of the plurality of phonemes, and determine a transcription parameter as a first correspondence of the incoming speech to the first speaker-independent model;
  
  the adjunct network entity including further instructions to select a first phoneme pattern, from a plurality of phoneme patterns, utilizing a second speaker-independent model having a grammar constrained by the plurality of phoneme patterns, and to determine a recognition parameter as a second correspondence of the incoming speech to the first phoneme pattern; and
  
  the adjunct network entity including further instructions to determine whether the input speech matches the first phoneme pattern based upon a third correspondence of the transcription parameter with the recognition parameter in accordance with a predetermined criterion.
- View Dependent Claims (38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58)
- - 38. The system of claim 37, wherein the first and second speaker-independent models are each a hidden Markov model.
  - 39. The system of claim 37, wherein the first and second speaker-independent models are subword (phoneme) based.
  - 40. The system of claim 37, wherein the phonetic representation is a phonetic transcription.
  - 41. The system of claim 37, wherein the plurality of phoneme patterns are generated from a plurality of speakers.
  - 42. The system of claim 37, wherein the incoming speech is from a first speaker of a plurality of speakers.
  - 43. The system of claim 37, wherein the first correspondence is a likelihood of fit.
  - 44. The system of claim 37, wherein the second correspondence is a likelihood of fit.
  - 45. The system of claim 37, wherein the third correspondence is a confidence ratio.
  - 46. The system of claim 37, wherein the adjunct network entity is a service node.
  - 47. The system of claim 37, wherein the adjunct network entity is a service control point.
  - 48. The system of claim 37, wherein the adjunct network entity is an intelligent peripheral.
  - 49. The system of claim 37, wherein the adjunct network entity is a compact service node and intelligent peripheral.
  - 50. The system of claim 37, wherein the adjunct network entity includes further instructions to determine that the input speech matches the first phoneme pattern when the transcription parameter compares with the recognition parameter in accordance with the predetermined criterion;
    - and to determine that the input speech does not match the first phoneme pattern when the transcription parameter does not compare with the recognition parameter in accordance with the predetermined criterion.
  - 51. The system of claim 37, wherein the adjunct network entity includes further instructions to compare the transcription parameter to the recognition parameter to form a confidence ratio;
    - when the confidence ratio is less than a predetermined threshold, to determine that the input speech matches the first phoneme pattern; and
      
      when the confidence ratio is not less than the predetermined threshold, to determine that the input speech does not match the first phoneme pattern.
  - 52. The system of claim 37, wherein the adjunct network entity includes further instructions to generate a name list, and wherein generating the name list includes receiving as incoming speech a first sample of a name and determining whether the first sample matches the first phoneme pattern;
    - when the first sample does not match the first phoneme pattern, the adjunct network entity including further instructions to include a phonetic representation of the first sample within the plurality of phoneme patterns.
  - 53. The system of claim 37, wherein the adjunct network entity includes further instructions to generate a name list, and wherein generating the name list includes receiving as incoming speech a first sample of a name and determining whether the first sample matches the first phoneme pattern;
    - when the first sample does not match the first phoneme pattern, the adjunct network entity including further instructions to initially include a phonetic representation of the first sample within the plurality of phoneme patterns, to receive as incoming speech a second sample of the name, and to determine whether the second sample matches the first sample; and
      
      , when the second sample does match the first sample, the adjunct network entity including further instructions to include the name in the name list and include corresponding phonetic representations of both the first sample and the second sample in the plurality of phoneme patterns stored in the memory.
  - 54. The system of claim 37, wherein the adjunct network entity includes further instructions to generate a message list, and wherein generating the message list includes receiving as incoming speech a caller name, and determining whether the caller name, matches the first phoneme pattern;
    - when the caller name does not match the first phoneme pattern, the adjunct network entity including further instructions to include the caller name in the message list and to indicate that one call has been received from the caller name; and
      
      when the caller name does match the first phoneme pattern, the adjunct network entity including further instructions to increment a count of calls received from the caller name.
  - 55. The system of claim 37, wherein the adjunct network entity includes further instructions to perform message playback, wherein performing message playback includes receiving incoming speech;
    - selecting the first phoneme pattern, from a subset of the a plurality of phoneme patterns corresponding to the message list, as the highest likelihood of fit to the incoming speech; and
      
      playing a first message associated with the first phoneme pattern.
  - 56. The system of claim 55, wherein the adjunct network entity includes further instructions, when a plurality of messages are associated with the first phoneme pattern, to sequentially play the plurality of messages.
  - 57. The system of claim 56, wherein the adjunct network entity includes further instructions to perform call return, wherein performing call return includes receiving incoming speech and selecting the first phoneme pattern, from a subset of the plurality of phoneme patterns corresponding to a name list and a message list, as the highest likelihood of fit to the incoming speech;
    - and wherein the adjunct network entity includes further instructions to transmit a telecommunication number associated with the first phoneme pattern.
  - 58. The system of claim 37, wherein the plurality of phoneme patterns are predetermined to correspond to a plurality of names on a call screening list of a subscriber, wherein the adjunct network entity includes further instructions to perform incoming call screening, and wherein performing incoming call screening includes receiving as incoming speech a caller name and determining whether the caller name matches a first phoneme pattern;
    - when the caller name does not match the first phoneme pattern, the adjunct network entity including further instructions to transfer the incoming call leg to a message system; and
      
      when the caller name does match the first phoneme pattern, the adjunct network entity including further instructions to transfer the incoming call leg to the subscriber.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Lucent Technologies, Inc. (Nokia Corporation)
Original Assignee
Lucent Technologies, Inc. (Nokia Corporation)
Inventors
Wisowaty, John Joseph, Curt, Carol Lynn, Sukkar, Rafid Antoon
Primary Examiner(s)
Banks-Harold, Marsha D.
Assistant Examiner(s)
ABEBE, DANIEL DEMELASH

Application Number

US09/234,204
Time in Patent Office

1,308 Days
Field of Search

704/254, 704/256, 704/246, 704/251, 704/275, 704/252, 379/88.03, 379/88.04
US Class Current

704/254
CPC Class Codes

G10L 15/187 Phonemic context, e.g. pron...

Apparatus, method and system for cross-speaker speech recognition for telecommunication applications

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

73 Citations

58 Claims

Specification

Use Cases

Quick Links

Others

Apparatus, method and system for cross-speaker speech recognition for telecommunication applications

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

73 Citations

58 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others