Context based language model selection

US 9,047,870 B2
Filed: 09/29/2011
Issued: 06/02/2015
Est. Priority Date: 12/23/2009
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented speech-to-text conversion method, comprising:

receiving a voice input provided by a user of an electronic device and contextual metadata that describes a context of the electronic device at a time when the voice input was received, the voice input received by a service running on the electronic device that is capable of providing, from voice or typed input, text output to multiple different applications on the electronic device, and is arranged to select a particular application of the multiple different applications to receive the text output, and the contextual metadata identifying text for a form field displayed to a user and to which the voice input was directed;

identifying a plurality of base language models, wherein each base language model corresponds to a distinct textual corpus of content, and wherein each base language model is trained based on clusters identified in a bipartite cluster graph having clusters that correspond to particular categories of queries entered to a search engine by multiple different client devices, the clusters including search queries and corresponding search results, in the form of web pages, extracted from a historical log that are paired based on the web sites being top results for particular corresponding queries;

selecting a particular base language model, from among the identified plurality of base language models, the selection based at least in part on the text corresponding to the field of the form displayed to the user and to which the voice input was directed; and

using the selected particular base language model to convert the received voice input to a textual output,wherein the service is;

able to (a) receive typed input in a typed mode and voice input in a spoken mode, and adopts the spoken mode based on a user selection before receiving the voice input, and (b) in response to receiving typed or voice input, provide text output to a first application, andarranged so that a particular instance of the service is external to the multiple different applications and provides text to different ones of the multiple different applications in a manner that speech-to-text conversion by the service is transparent to the different ones of the multiple different applications.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Methods, computer program products and systems are described for speech-to-text conversion. A voice input is received from a user of an electronic device and contextual metadata is received that describes a context of the electronic device at a time when the voice input is received. Multiple base language models are identified, where each base language model corresponds to a distinct textual corpus of content. Using the contextual metadata, an interpolated language model is generated based on contributions from the base language models. The contributions are weighted according to a weighting for each of the base language models. The interpolated language model is used to convert the received voice input to a textual output. The voice input is received at a computer server system that is remote to the electronic device. The textual output is transmitted to the electronic device.

Citations

31 Claims

1. A computer-implemented speech-to-text conversion method, comprising:
- receiving a voice input provided by a user of an electronic device and contextual metadata that describes a context of the electronic device at a time when the voice input was received, the voice input received by a service running on the electronic device that is capable of providing, from voice or typed input, text output to multiple different applications on the electronic device, and is arranged to select a particular application of the multiple different applications to receive the text output, and the contextual metadata identifying text for a form field displayed to a user and to which the voice input was directed;
  
  identifying a plurality of base language models, wherein each base language model corresponds to a distinct textual corpus of content, and wherein each base language model is trained based on clusters identified in a bipartite cluster graph having clusters that correspond to particular categories of queries entered to a search engine by multiple different client devices, the clusters including search queries and corresponding search results, in the form of web pages, extracted from a historical log that are paired based on the web sites being top results for particular corresponding queries;
  
  selecting a particular base language model, from among the identified plurality of base language models, the selection based at least in part on the text corresponding to the field of the form displayed to the user and to which the voice input was directed; and
  
  using the selected particular base language model to convert the received voice input to a textual output,wherein the service is;
  
  able to (a) receive typed input in a typed mode and voice input in a spoken mode, and adopts the spoken mode based on a user selection before receiving the voice input, and (b) in response to receiving typed or voice input, provide text output to a first application, andarranged so that a particular instance of the service is external to the multiple different applications and provides text to different ones of the multiple different applications in a manner that speech-to-text conversion by the service is transparent to the different ones of the multiple different applications.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 30, 31)
- - 2. The method of claim 1, wherein the voice input is received at a computer server system that is remote to the electronic device, the method further comprising:
    - transmitting the textual output to the electronic device.
  - 3. The method of claim 1, wherein the form field is identified from a position of a cursor on the electronic device when the voice input was received at the electronic device.
  - 4. The method of claim 1, further comprising:
    - generating an interpolated language model from a plurality of the base language models using determined weights for each of the base language models, wherein the weights are based at least in part on the contextual metadata, and the base language models are formed from queries entered by multiple users into a search engine.
  - 5. The method of claim 1, further comprising:
    - building one or more of the base language models based on text input data collected from a plurality of users and metadata that corresponds to the text input data.
  - 6. The method of claim 5, wherein for a particular text input data the corresponding metadata identifies an input field that corresponds to the text input data.
  - 7. The method of claim 5, wherein the text input data and the metadata are formed as individual pairs, the method further comprising:
    - forming a bipartite cluster graph of the individual pairs.
  - 8. The method of claim 7, further comprising identifying clusters in the graph and using the clusters to generate a language model.
  - 9. The method of claim 8, further comprising training the base language models by using sample voice utterances from a plurality of users of a plurality of electronic devices.
  - 30. The method of claim 1, wherein the user selection that occurs before receiving the voice input comprises user selection of a voice input icon shown on a display along with a virtual keyboard.
  - 31. The method of claim 1, wherein the selected particular base language model is used to convert the received voice input to a textual output by forming an interpolated language model that includes links to multiple base language models that are formed using logs of text input into one or more input fields by multiple users.

10. A computer-implemented system for converting speech to text, the system comprising:
- one or more computer processors; and
  
  one or more computer-readable devices including instructions that, when executed by the one or more computer processors, implement;
  
  an application of an operating system on distributed electronic devices, the application programmed to obtain, via a single instance of the application, both typed input and voice input, and to generate for a determined one of multiple different applications executable on a particular electronic device, text from either the typed input of the voice input depending on a user selection to place the particular electronic device in a typed input mode or a voice input mode, wherein voice input is accompanied by contextual metadata that describes a position of a cursor on a display of the particular electronic device at a time when the voice input is obtained;
  
  a plurality of base language models, each base language model corresponding to a particular semantic category, wherein the system is programmed to identify a particular base language model from the plurality of base language models, the identification based at least in part on the position of the cursor on the display of the particular electronic device at the time when the voice input is obtained,wherein each base language model is trained based on clusters identified in a bipartite cluster graph having clusters that correspond to particular categories of queries entered to a search engine by multiple different client devices, the clusters including search queries and corresponding search results, in the form of web pages, extracted from a historical log that are paired based on the web sites being top results for particular corresponding queries.
- View Dependent Claims (11, 12, 13, 14, 15, 16)
- - 11. The system of claim 10, further comprising an interpolated language model that is linked to the plurality of base language models, wherein each link between the interpolated language model and each of the base language models is associated with a weight, and the weight for each link between the interpolated language mode and a base language model is based on an accuracy of the base language model in associating a voice input with a text output representing a conversion of the voice input into text.
  - 12. The system of claim 11, wherein the weights represent likelihoods of usage in the interpolated language model matching usage in the particular base language model.
  - 13. The system of claim 11, wherein the weights are a function of the semantic category.
  - 14. The system of claim 13, further comprising:
    - a network interface configured to;
      
      receive a voice input; and
      
      cause the interpolated language model to be applied to the voice input to generate a text output.
  - 15. The system of claim 14, wherein the network interface is further configured to:
    - use metadata received with the voice input to match to the semantic category to determine weights for the base language models from the interpolated language model.
  - 16. The system of claim 15, wherein the system is configured to dynamically apply the weights for the plurality of base language models in real-time substantially as the voice input is received by the network interface.

17. A non-transitory computer-readable storage device encoded with a computer program product, the computer program product including instructions for speech-to-text conversion that, when executed, cause data processing apparatus to perform operations comprising:
- receiving a voice input provided by a user of an electronic device and contextual metadata that describes a context of the electronic device at a time when the voice input is received, the voice input received by a service running on the electronic device that is capable of providing, from voice or typed input, text output to multiple different applications on the electronic device, and is arranged to select a particular application of the multiple different applications to receive the text output, and the contextual metadata identifying text for a form field displayed to a user and to which the voice input was directed;
  
  identifying a plurality of base language models, wherein each base language model corresponds to a distinct textual corpus of content, and wherein each base language model is trained based on clusters identified in a bipartite cluster graph having clusters that correspond to particular categories of queries entered to a search engine by multiple different client devices, the clusters including search queries and corresponding search results, in the form of web pages, extracted from a historical log that are paired based on the web sites being top results for particular corresponding queries;
  
  selecting a particular base language model, from among the identified plurality of base language models, the selection based at least in part on the text corresponding to the field of the form displayed to the user and to which the voice input was directed; and
  
  using the selected particular language model to convert the received voice input to a textual output wherein the service is;
  
  able to (a) receive typed input in a typed mode and voice input in a spoken mode, and adopts the spoken mode based on a user selection before receiving the voice input, and (b) in response to receiving typed or voice input, provide text output to a first application, and;
  
  arranged so that a particular instance of the service is external to the multiple different applications and provides text to different ones of the multiple different applications in a manner that speech-to-text conversion by the service is transparent to the different ones of the multiple different applications.
- View Dependent Claims (18, 19, 20, 21, 22, 23, 24, 25, 26)
- - 18. The non-transitory computer-readable storage device of claim 17, wherein the voice input is received at a computer server system that is remote to the electronic device, the operations further comprising:
    - transmitting the textual output to the electronic device.
  - 19. The non-transitory computer-readable storage device of claim 17, wherein the form field is identified from a position of a cursor on the electronic device when the voice input was received at the electronic device.
  - 20. The non-transitory computer-readable storage device of claim 17, the operations further comprising:
    - generating an interpolated language model from a plurality of the base language models using determined weights for each of the base language models, wherein the weights are based at least in part on the contextual metadata, and the base language models are formed from queries entered by multiple users into a search engine.
  - 21. The non-transitory computer-readable storage device of claim 20, wherein the interpolated language model is based on contributions from each of the base language models that are proportionate to the respective weightings for each of the base language models.
  - 22. The non-transitory computer-readable storage device of claim 17, the operations further comprising:
    - building one or more of the base language models based on text input data collected from a plurality of users and metadata that corresponds to the text input data.
  - 23. The non-transitory computer-readable storage device of claim 22, wherein for a particular text input data the corresponding metadata identifies an input field that corresponds to the text input data.
  - 24. The non-transitory computer-readable storage device of claim 22, wherein the text input data and the metadata are formed as individual pairs, the operations further comprising:
    - forming a bipartite cluster graph of the individual pairs.
  - 25. The non-transitory computer-readable storage device of claim 24, the operations further comprising:
    - identifying clusters in the graph and using the clusters to generate a language model.
  - 26. The non-transitory computer-readable storage device of claim 25, the operations further comprising:
    - training the base language models by using sample voice utterances from a plurality of users of a plurality of electronic devices.

27. A computer-implemented method, comprising;
- extracting, by a computer system, pairs from a historical log of query search results that includes a plurality of search queries and corresponding search results, each pair including a query and a website that corresponds to a search result for the query, wherein each query was previously entered to a search engine by multiple different electronic devices and each website was returned as a search result to a corresponding query;
  
  generating, by the computer system, a bipartite cluster graph based on the extracted pairs of queries and corresponding websites, the bipartite graph having paired queries and websites wherein the queries and websites are paired based on particular websites determined to be top search results for particular queries entered by the multiple different electronic devices;
  
  training, by the computer system, a plurality of language models based on clusters identified in the bipartite cluster graph, wherein the clusters correspond to particular categories of the queries entered to the search engine by the multiple different electronic devices;
  
  based on sample data that comprises one or more sample queries obtained from the multiple different electronic devices that provided input by one or more users into a web form, identifying K clusters from the cluster graph that are most significant to the sample queries, K being an integer;
  
  identifying a correlation between the input by the one or more users into the web form and one or more topics to which particular ones of the language models are directed; and
  
  generating, by the computer system, an interpolated language model for the web form based at least in part on (i) weighted contributions from the language models trained for each of the identified K clusters and (ii) the determined correlation.
- View Dependent Claims (28, 29)
- - 28. The method of claim 27 further comprising:
    - receiving a voice input into the web form; and
      
      based on the interpolated language model generating a text output that represents the voice input.
  - 29. The method of claim 28, wherein the voice input is received from a user of an electronic device and the text output is generated by a computer server system that is remote to the electronic device, the method further comprising:
    - transmitting the text output to the electronic device.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Ballinger, Brandon M., Schalkwyk, Johan, Cohen, Michael H., Allauzen, Cyril Georges Luc, Riley, Michael D.
Primary Examiner(s)
BAKER, MATTHEW H

Application Number

US13/249,181
Publication Number

US 20120022867A1
Time in Patent Office

1,342 Days
Field of Search

704 1- 10, 704/231, 704/275, 704/277
US Class Current

1/1
CPC Class Codes

G06F 3/04886   by partitioning the display...

G06F 3/167   Audio in a user interface, ...

G06F 40/284   Lexical analysis, e.g. toke...

G06F 40/58   Use of machine translation,...

G10L 15/005   Language recognition

G10L 15/18   using natural language mode...

G10L 15/183   using context dependencies,...

G10L 15/197   Probabilistic grammars, e.g...

G10L 15/22   Procedures used during a sp...

G10L 15/26   Speech to text systems G10L...

G10L 15/30   Distributed recognition, e....

G10L 2015/223   Execution procedure of a sp...

G10L 2015/228   of application context

Context based language model selection

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

31 Claims

Specification

Solutions

Use Cases

Quick Links

Context based language model selection

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

31 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links