Speech to Text Conversion

US 20110161080A1
Filed: 12/22/2010
Published: 06/30/2011
Est. Priority Date: 12/23/2009
Status: Abandoned Application

First Claim

Patent Images

1. A computer-implemented speech-to-text conversion method, comprising:

receiving a voice input from a user of an electronic device and contextual metadata that describes a context of the electronic device at a time when the voice input is received;

identifying a plurality of base language models, wherein each base language model corresponds to a distinct textual corpus of content;

using the contextual metadata to generate an interpolated language model based on contributions from the plurality of base language models, wherein the contributions are weighting according to a weighting for each of the base language models; and

using the interpolated language model to convert the received voice input to a textual output.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Methods, computer program products and systems are described for speech-to-text conversion. A voice input is received from a user of an electronic device and contextual metadata is received that describes a context of the electronic device at a time when the voice input is received. Multiple base language models are identified, where each base language model corresponds to a distinct textual corpus of content. Using the contextual metadata, an interpolated language model is generated based on contributions from the base language models. The contributions are weighted according to a weighting for each of the base language models. The interpolated language model is used to convert the received voice input to a textual output. The voice input is received at a computer server system that is remote to the electronic device. The textual output is transmitted to the electronic device.

Citations

29 Claims

1. A computer-implemented speech-to-text conversion method, comprising:
- receiving a voice input from a user of an electronic device and contextual metadata that describes a context of the electronic device at a time when the voice input is received;
  
  identifying a plurality of base language models, wherein each base language model corresponds to a distinct textual corpus of content;
  
  using the contextual metadata to generate an interpolated language model based on contributions from the plurality of base language models, wherein the contributions are weighting according to a weighting for each of the base language models; and
  
  using the interpolated language model to convert the received voice input to a textual output.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The method of claim 1, wherein the voice input is received at a computer server system that is remote to the electronic device, the method further comprising:
    - transmitting the textual output to the electronic device.
  - 3. The method of claim 1, wherein the contextual metadata identifies a field in an electronic document on which the electronic device was focused when the voice input was received at the electronic device.
  - 4. The method of claim 1, further comprising:
    - determining the weightings for each of the base language models based on the contextual metadata.
  - 5. The method of claim 1, further comprising:
    - building one or more of the base language models based on text input data collected from a plurality of users and metadata that corresponds to the text input data.
  - 6. The method of claim 5, wherein for a particular text input data the corresponding metadata identifies an input field that corresponds to the text input data.
  - 7. The method of claim 5, wherein the text input data and the metadata are formed as individual pairs, the method further comprising:
    - forming a bipartite cluster graph of the individual pairs.
  - 8. The method of claim 7, further comprising identifying clusters in the graph and using the clusters to generate the interpreted language model.
  - 9. The method of claim 8, further comprising training the base language models by using sample voice utterances from a plurality of users of a plurality of electronic devices.

10. A computer-implemented system for converting speech to text, the system comprising:
- a plurality of base language models, each base language model corresponding to a particular semantic category;
  
  an interpolated language model that is linked to the plurality of base language models; and
  
  wherein each link between the interpolated language model and each of the base language models is associated with a weight.
- View Dependent Claims (11, 12, 13, 14, 15, 16)
- - 11. The system of claim 10, wherein the weight for each link between the interpolated language mode and a base language model is based on an accuracy of the base language model in associating a voice input with a text output representing a conversion of the voice input into text.
  - 12. The system of claim 10, wherein the weights represent likelihoods of usage in the interpolated language model matching usage in the particular base language model.
  - 13. The system of claim 12, wherein the weightings are a function of the semantic category.
  - 14. The system of claim 13, further comprising:
    - a network interface configured to;
      
      receive a voice input; and
      
      cause the interpolated language model to be applied to the voice input to generate a text output.
  - 15. The system of claim 14, wherein the network interface is further configured to:
    - use metadata received with the voice input to match to the semantic category to determine weightings for the base language models from the interpolated language model.
  - 16. The system of claim 15, wherein the system is configured to dynamically apply the weightings for the plurality of base language models in real-time substantially as the voice input is received by the network interface.

17. A computer-readable storage device encoded with a computer program product, the computer program product including instructions for speech-to-text conversion that, when executed, cause data processing apparatus to perform operations comprising:
- receiving a voice input from a user of an electronic device and contextual metadata that describes a context of the electronic device at a time when the voice input is received;
  
  identifying a plurality of base language models, wherein each base language model corresponds to a distinct textual corpus of content;
  
  using the contextual metadata to generate an interpolated language model based on contributions from the plurality of base language models, wherein the contributions are weighting according to a weighting for each of the base language models; and
  
  using the interpolated language model to convert the received voice input to a textual output.
- View Dependent Claims (18, 19, 20, 21, 22, 23, 24, 25, 26)
- - 18. The computer-readable storage device of claim 17, wherein the voice input is received at a computer server system that is remote to the electronic device, the operations further comprising:
    - transmitting the textual output to the electronic device.
  - 19. The computer-readable storage device of claim 17, wherein the contextual metadata identifies a field in an electronic document on which the electronic device was focused when the voice input was received at the electronic device.
  - 20. The computer-readable storage device of claim 17, the operations further comprising:
    - determining weightings for each of the base language models based on the contextual metadata.
  - 21. The computer-readable storage device of claim 20, wherein the interpolated language model is based on contributions from each of the base language models that are proportionate to the respective weightings for each of the base language models.
  - 22. The computer-readable storage device of claim 17, the operations further comprising:
    - building one or more of the base language models based on text input data collected from a plurality of users and metadata that corresponds to the text input data.
  - 23. The computer-readable storage device of claim 22, wherein for a particular text input data the corresponding metadata identifies an input field that corresponds to the text input data.
  - 24. The computer-readable storage device of claim 22, wherein the text input data and the metadata are formed as individual pairs, the operations further comprising:
    - forming a bipartite cluster graph of the individual pairs.
  - 25. The computer-readable storage device of claim 24, the operations further comprising:
    - identifying clusters in the graph and using the clusters to generate the interpreted language model.
  - 26. The computer-readable storage device of claim 25, the operations further comprising:
    - training the base language models by using sample voice utterances from a plurality of users of a plurality of electronic devices.

27. A computer-implemented method, comprising;
- extracting pairs from a historical log of query search results that includes a plurality of search queries and corresponding search results, each pair including a query and a website that corresponds to a search result for the query;
  
  generating a bipartite cluster graph based on the extracted pairs of queries and corresponding websites;
  
  training a plurality of language models based on clusters identified in the bipartite cluster graph;
  
  based on sample data obtained from input by one or more users into a web from, the sample data comprising one or more sample queries, identifying K clusters from the cluster graph that are most significant to the sample queries, K being an integer; and
  
  generating an interpolated language model for the web form based on weighted contributions from the language models trained for each of the identified K clusters.
- View Dependent Claims (28, 29)
- - 28. The method of claim 27 further comprising:
    - receiving a voice input into the web form; and
      
      based on the interpolated language model generating a text output that represents the voice input.
  - 29. The method of claim 28, wherein the voice input is received from a user of an electronic device and the text output is generated by a computer server system that is remote to the electronic device, the method further comprising:
    - transmitting the text output to the electronic device.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Riley, Michael D., Schalkwyk, Johan, Cohen, Michael H., Ballinger, Brandon M., Allauzen, Cyril Georges Luc

Application Number

US12/976,972
Publication Number

US 20110161080A1
Time in Patent Office

Days
Field of Search
US Class Current

704/235
CPC Class Codes

G06F 3/04886   by partitioning the display...

G06F 3/167   Audio in a user interface, ...

G06F 40/284   Lexical analysis, e.g. toke...

G06F 40/58   Use of machine translation,...

G10L 15/005   Language recognition

G10L 15/18   using natural language mode...

G10L 15/183   using context dependencies,...

G10L 15/197   Probabilistic grammars, e.g...

G10L 15/22   Procedures used during a sp...

G10L 15/26   Speech to text systems G10L...

G10L 15/30   Distributed recognition, e....

G10L 2015/223   Execution procedure of a sp...

G10L 2015/228   of application context

Speech to Text Conversion

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

29 Claims

Specification

Solutions

Use Cases

Quick Links

Speech to Text Conversion

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

29 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links