Distributed real time speech recognition system

US 20050080625A1
Filed: 10/10/2003
Published: 04/14/2005
Est. Priority Date: 11/12/1999
Status: Active Grant

First Claim

Patent Images

1. (Canceled)

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A real-time system incorporating speech recognition and linguistic processing for recognizing a spoken query by a user and distributed between client and server, is disclosed. The system accepts user'"'"'s queries in the form of speech at the client where minimal processing extracts a sufficient number of acoustic speech vectors representing the utterance. These vectors are sent via a communications channel to the server where additional acoustic vectors are derived. Using Hidden Markov Models (HMMs), and appropriate grammars and dictionaries conditioned by the selections made by the user, the speech representing the user'"'"'s query is fully decoded into text (or some other suitable form) at the server. This text corresponding to the user'"'"'s query is then simultaneously sent to a natural language engine and a database processor where optimized SQL statements are constructed for a full-text search from a database for a recordset of several stored questions that best matches the user'"'"'s query. Further processing in the natural language engine narrows the search to a single stored question. The answer corresponding to this single stored question is next retrieved from the file path and sent to the client in compressed form. At the client, the answer to the user'"'"'s query is articulated to the user using a text-to-speech engine in his or her native natural language. The system requires no training and can operate in several natural languages.

Citations

70 Claims

1. (Canceled)

2. (Canceled)

3. (Canceled)

4. (Canceled)

5. (Canceled)

6. (Canceled)

7. (Canceled)

8. (Canceled)

9. (Canceled)

10. (Canceled)

11. A machine executable program for use in a voice query recognition system that is distributed across a client system and a separate server system, the program comprising:
- a first audio signal receiving routine for receiving user speech utterance signals representing speech utterances to be recognized during a sequence of speech utterance evaluation time frames, said speech utterances including sentences comprised of one or more words; and
  
  a first signal processing routine adapted to generate representative speech data values for each speech utterance evaluation time frame during which speech utterance signals are received, said representative speech data values including a set of compressed mel-frequency cepstral coefficients (MFCC);
  
  a formatting routine for rendering said representative speech data values into a transmission format suitable for transmission from the client system over a communications channel to a second processing routine executing on the server computing system; and
  
  wherein said representative speech data values are transmitted continuously during said speech utterances within streaming packets and without waiting for silence to be detected and/or said speech utterances to be completed;
  
  further wherein said representative speech data values constitute a minimum amount of information that can be used by said second processing routine to complete accurate recognition of said one or more words and said sentences.
- View Dependent Claims (12, 13, 14, 15, 46)
- - 12. The program of claim 11, wherein said program works within a browser program executing on said computing system as part of a client-server based system.
  - 13. The program of claim 11, wherein said set of compressed MFCCs is generated at a rate corresponding to at least 100 frames per second, and such that said set of compressed MFCCs includes a separate cepstral coefficient value for a corresponding frequency component of said user speech utterance signals, and said first data content corresponds to a set of said frequency components spanning an audible speech frequency range.
  - 14. The program of claim 11, wherein additional data content including a set of delta and acceleration coefficients is computed from said set of compressed MFCCs at either said client system or said server computing system on a connection by connection basis based on an evaluation of computing resources available at such client system and said server computing system.
  - 15. The program of claim 11, wherein said second processing routine is configured with an amount of resources by said server computing system based on a bandwidth and transmission speed associated with a transmission link between said server computing system and said client system so that said second processing routine performs accurate recognition of said one or more words with a first latency that is less than a second latency that would result if said one or more words were recognized by said first signal processing routine and then transmitted over said transmission link.
  - 46. The system of claim 11, further wherein said second processing means performs a query for determining which of said one or mote words correspond to said one or more text words.

16. (Canceled)

17. (Canceled)

18. (Canceled)

19. (Canceled)

20. (Canceled)

21. (Canceled)

22. (Canceled)

23. (Canceled)

24. (Canceled)

25. (Canceled)

26. (Canceled)

27. A system for assisting a client computing device to perform speech recognition in cooperation with a server computing device, the system comprising:
- a speech utterance capture circuit for receiving a speech utterance and generating associated speech utterance signals, where said speech utterance can include an articulated sentence of one or more articulated words; and
  
  a speech utterance signal processing circuit, said signal processing being configurable to perform data extracting operations on said speech utterance signals to generate a set of frequency related Speech utterance signals for said articulated sentence; and
  
  wherein said set of frequency related speech utterance signals include a set of compressed mel-frequency cepstral coefficients (MFCC);
  
  a transmission circuit for coding said set of frequency related speech utterance signals into a format suitable for transmission over a communications channel to the server;
  
  a receiving circuit for receiving a response to said articulated sentence through said communications channel from the server, said response being generated by said server using said set of frequency related speech utterance signals to perform a word recognition operation on said one or more articulated words and a sentence recognition operation on said articulated sentence; and
  
  wherein a latency associated with performing said speech recognition is minimized by optimizing an allocation of signal processing responsibilities for said speech utterance signals between the client computing device and the server computing device on a case-by-case basis in accordance with signal processing capabilities of the client computing device.

28. A system for assisting a client computing device to perform real-time speech recognition in cooperation with a server computing device, the system comprising:
- a sound processing circuit integrated within the client computing device, said sound processing circuit being adapted to receive a continuous speech utterance and to generate associated speech utterance signals therefrom, wherein said speech utterance can include an articulated sentence of one or more articulated words; and
  
  a first signal processing routine adapted to be executed by the client computing device, and which first signal processing routine is further adapted to continuously generate a set of speech-based vector coefficients as needed from said speech utterance signals; and
  
  a transmission circuit coupled to the client computing device for coding said set of speech based vector coefficients into a format suitable for transmission over a communications channel to the server, said set of speech-based vector coefficients being continuously transmitted in real-time within a Hypertext Transport Protocol (HTTP) byte stream as said speech utterances occur;
  
  a receiving circuit coupled to the client computing device for receiving a real-time response to said articulated sentence through said communications channel from the server;
  
  wherein said response is generated by said server substantially on a real-time basis using said set of speech based vector coefficients to perform a second signal processing routine which completes a word recognition operation on said one or more articulated words, as well as a sentence recognition operation on said articulated sentence;
  
  further wherein at least some words ate recognized in real-time before said speech utterance is completed.

29. A distributed speech recognition system for processing a speech utterance comprising:
- a first signal processing circuit associated with a client computing system, said first signal processing circuit being adapted to generate a first set of speech data values from speech utterance signals, wherein said first set of speech data values have a limited data content and are compressed without quantization to reduce processing and transmission latencies in the distributed speech recognition system;
  
  a second signal processing circuit associated with a separate server computing system, said second signal processing circuit being configured to generate a second set of speech data values derived from said first set of speech data values, and being further configured to generate a combined speech data value set consisting of said second set of speech data values and said first set of data values;
  
  a word recognition circuit adapted to use said combined speech data value set and for generating recognizing words in the speech utterance, said word recognition circuit being configured to recognize words before said speech utterance is finished.
- View Dependent Claims (30, 31, 32, 33, 34)
- - 30. The system of claim 29, further including a sentence recognition circuit which recognizes an articulated sentence containing said recognized words.
  - 31. The system of claim 30, wherein said articulated sentence can include one of a number of predefined sentences recognizable by said system, and said articulated sentence is recognized by identifying a candidate set of potential sentences from said number of predefined sentences corresponding to said articulated sentence, and then comparing each entry in the candidate set of potential sentences to said articulate sentence to determine a matching recognized sentence.
  - 32. The system of claim 31, wherein said articulated sentence is processed by a natural language engine operating on said recognized words.
  - 33. The system of claim 32, wherein said articulated sentence is compared against said candidate set of potential sentences by examining noun phrases including noun phrases consisting of multiple words.
  - 34. The system of claim 31, wherein said candidate set of potential sentences are determined in part by a context dictionary loaded by said sentence recognition circuit in response to an operating environment presented by said system to a user.

35. (Canceled)

36. (Canceled)

37. (Canceled)

38. (Canceled)

39. (Canceled)

40. (Canceled)

41. (Canceled)

42. (Canceled)

43. (Canceled)

44. (Canceled)

45. (Canceled)

47. (Canceled)

48. (Canceled)

49. (Canceled)

50. (Canceled)

51. (Canceled)

52. (Canceled)

53. A method of performing distributed voice recognition comprising the steps of:
- (a) receiving user speech utterance signals representing speech utterances to be recognized during a sequence of speech utterance evaluation time frames, said speech utterances including sentences comprised of one or more words; and
  
  (b) generating representative speech data values with a first processing circuit for each speech utterance evaluation time frame during which speech utterance signals are received, said representative speech data values including a set of compressed mel-frequency cepstral coefficients (MFCC);
  
  (c) encoding said representative speech data values into a transmission format suitable for transmission over a communications channel to a second processing circuit; and
  
  further wherein said representative speech data values constitute a minimum amount of information that can be used by said second processing circuit to complete accurate recognition of said one or more words and said sentences.
- View Dependent Claims (54, 55, 56, 57)
- - 54. The method of claim 53, wherein said recognition of said one or more words occurs in real-time.
  - 55. The method of claim 53, wherein said set of compressed MFCCs is generated at a rate corresponding to at least 100 frames per second, and such that said set of compressed MFCCs includes a separate cepstral coefficient value for a corresponding frequency component of said user speech utterance signals, and said first data content corresponds to a set of said frequency components spanning an audible speech frequency range.
  - 56. The method of claim 55, wherein a set of delta and acceleration coefficients are computed from said cepstral coefficient values to complete recognition of said one or more words and said sentences, wherein such set of delta and acceleration coefficients are computed at either said first processing circuit or said second processing circuit on a connection by connection basis based on an evaluation of computing resources available at such respective processing circuits.
  - 57. The method of claim 53, wherein said second processing circuit is configured with an amount of resources by a server computing system based on a bandwidth and transmission speed associated with a transmission link between said server computing system and a client system associated with the first processing circuit, so that said second processing circuit performs accurate recognition of said one or more words with a first latency that is less than a second latency that would result if said one or mote words were recognized by said first processing circuit and then transmitted over said transmission link.

58. A method of performing distributed speech recognition using a first computing device and a second computing device, the method comprising the steps of:
- (a) evaluating speech processing capabilities of the first computing device using an initialization routine, and (b) evaluating a transmission latency of a communications channel coupling the first computing device and the second computing device, and (c) allocating speech processing tasks between the first computing device and the second computing device based on results of steps (a) and (b), such that an overall speech recognition process is customized on a case-by-case basis for performance characteristics of the first computing device and the second computing device; and
  
  (d) receiving a speech utterance at the first computing device; and
  
  (e) generating associated speech utterance signals from said speech utterance with the first computing device; and
  
  (f) generate a first set of speech data values from said speech utterance signals at the first computing device, said first set of speech data values being insufficient by themselves for permitting recognition of words articulated in said speech utterance; and
  
  (g) formatting said first set of speech data values at the first computing device to be compatible with a communications protocol used by said communications channel;
  
  (h) transmitting said first set of speech data values through said channel to the second computing device; and
  
  (i) generating a second set of speech data values based on said speech data values, such that second set of speech data values contain sufficient information to be usable by a word recognition engine for recognizing words in said speech utterance.
- View Dependent Claims (59, 60, 61, 62, 63, 64)
- - 59. The method of claim 58, wherein said second set of speech data values include said first set of speech data values and a derived set of speech data values, which derived set of speech data values are computed based on said first speech data values.
  - 60. The method of claim 58, wherein said second set of speech data values can be generated by said second computing device in a time that is less than the combination of a first time which would be required by said first computing device to generate said second set of speech data values from said first set of speech data values combined with a second time which would be required to format and transmit said second set of speech data values.
  - 61. The method of claim 58, wherein signal processing responsibilities of said first and second computing devices are allocated such that said first computing device performs less than approximately ½
    - the required signal processing operations needed to convert said speech utterance signals into a form usable by a word recognition engine.
  - 62. The method of claim 58, wherein said speech processing tasks performed by said first and second computing devices are further allocated based on:
    - transmission speed capabilities of a transceiver coupled to said first computing device.
  - 63. The method of claim 58, wherein said first processing device is also configured to assist said second processing device with signal processing computations required to generate said second set of speech data values.
  - 64. The method of claim 58, wherein said first set of speech data values represent the least amount of data that can used by said second processing device to generate said second set of data values usable for a word recognition process.

65. A method of performing distributed recognition of a speech utterance comprising the steps of:
- (a) generating a first set of speech data values from speech utterance signals at a first computing system, wherein said first set of speech data values have a limited data content to reduce processing and transmission latencies; and
  
  wherein said first set of speech data values include a set of compressed mel-frequency cepstral coefficients (MFCC). (b) generating a second set of speech data values derived from said first set of speech data values at a second computing system, said second computing system being independently operable from said first computing system; and
  
  (c) generating a combined speech data value set at said second computing system consisting of said second set of speech data values and said first set of data values;
  
  (d) generating a list of recognized words in said speech utterance, said list being generated at least in part before said speech utterance is finished.
- View Dependent Claims (66, 67, 68, 69, 70)
- - 66. The method of claim 65, further including a step:
    - (e) recognizing an articulated sentence containing said list of recognized words.
  - 67. The method of claim 66, wherein said articulated sentence can include one of a number of predefined recognizable sentences, and said articulated sentence is recognized by identifying a candidate set of potential sentences from said number of predefined sentences corresponding to said articulated sentence, and then comparing each entry in the candidate set of potential sentences to said articulate sentence to determine a matching recognized sentence.
  - 68. The method of claim 67, wherein said articulated sentence is processed by a natural language engine operating on said recognized words.
  - 69. The method of claim 68, wherein said articulated sentence is compared against said candidate set of potential sentences by examining noun phrases.
  - 70. The method of claim 67, wherein said candidate set of potential sentences are determined in part by a context dictionary loaded in response to an operating environment presented to a user articulating said sentence.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Original Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Inventors
Babu, Bandi Ramesh, Gururaj, Pallaki, Morkhandikar, Kishor, Bennett, Ian M.

Granted Patent

US 9,076,448 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/249
CPC Class Codes

G06F 16/24522   Translation of natural lang...

G06F 40/58   Use of machine translation,...

G10L 15/005   Language recognition

G10L 15/142   Hidden Markov Models [HMMs]

G10L 15/18   using natural language mode...

G10L 15/183   using context dependencies,...

G10L 15/22   Procedures used during a sp...

G10L 15/30   Distributed recognition, e....

G10L 17/22   Interactive procedures; Man...

Y10S 707/99935   Query augmenting and refini...

Distributed real time speech recognition system

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

Citations

70 Claims

Specification

Solutions

Use Cases

Quick Links

Distributed real time speech recognition system

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

70 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links