Context based language model selection
First Claim
1. A computer-implemented speech-to-text conversion method, comprising:
- receiving a voice input provided by a user of an electronic device and contextual metadata that describes a context of the electronic device at a time when the voice input was received, the voice input received by a service running on the electronic device that is capable of providing, from voice or typed input, text output to multiple different applications on the electronic device, and is arranged to select a particular application of the multiple different applications to receive the text output, and the contextual metadata identifying text for a form field displayed to a user and to which the voice input was directed;
identifying a plurality of base language models, wherein each base language model corresponds to a distinct textual corpus of content, and wherein each base language model is trained based on clusters identified in a bipartite cluster graph having clusters that correspond to particular categories of queries entered to a search engine by multiple different client devices, the clusters including search queries and corresponding search results, in the form of web pages, extracted from a historical log that are paired based on the web sites being top results for particular corresponding queries;
selecting a particular base language model, from among the identified plurality of base language models, the selection based at least in part on the text corresponding to the field of the form displayed to the user and to which the voice input was directed; and
using the selected particular base language model to convert the received voice input to a textual output,wherein the service is;
able to (a) receive typed input in a typed mode and voice input in a spoken mode, and adopts the spoken mode based on a user selection before receiving the voice input, and (b) in response to receiving typed or voice input, provide text output to a first application, andarranged so that a particular instance of the service is external to the multiple different applications and provides text to different ones of the multiple different applications in a manner that speech-to-text conversion by the service is transparent to the different ones of the multiple different applications.
2 Assignments
0 Petitions
Accused Products
Abstract
Methods, computer program products and systems are described for speech-to-text conversion. A voice input is received from a user of an electronic device and contextual metadata is received that describes a context of the electronic device at a time when the voice input is received. Multiple base language models are identified, where each base language model corresponds to a distinct textual corpus of content. Using the contextual metadata, an interpolated language model is generated based on contributions from the base language models. The contributions are weighted according to a weighting for each of the base language models. The interpolated language model is used to convert the received voice input to a textual output. The voice input is received at a computer server system that is remote to the electronic device. The textual output is transmitted to the electronic device.
-
Citations
31 Claims
-
1. A computer-implemented speech-to-text conversion method, comprising:
-
receiving a voice input provided by a user of an electronic device and contextual metadata that describes a context of the electronic device at a time when the voice input was received, the voice input received by a service running on the electronic device that is capable of providing, from voice or typed input, text output to multiple different applications on the electronic device, and is arranged to select a particular application of the multiple different applications to receive the text output, and the contextual metadata identifying text for a form field displayed to a user and to which the voice input was directed; identifying a plurality of base language models, wherein each base language model corresponds to a distinct textual corpus of content, and wherein each base language model is trained based on clusters identified in a bipartite cluster graph having clusters that correspond to particular categories of queries entered to a search engine by multiple different client devices, the clusters including search queries and corresponding search results, in the form of web pages, extracted from a historical log that are paired based on the web sites being top results for particular corresponding queries; selecting a particular base language model, from among the identified plurality of base language models, the selection based at least in part on the text corresponding to the field of the form displayed to the user and to which the voice input was directed; and using the selected particular base language model to convert the received voice input to a textual output, wherein the service is; able to (a) receive typed input in a typed mode and voice input in a spoken mode, and adopts the spoken mode based on a user selection before receiving the voice input, and (b) in response to receiving typed or voice input, provide text output to a first application, and arranged so that a particular instance of the service is external to the multiple different applications and provides text to different ones of the multiple different applications in a manner that speech-to-text conversion by the service is transparent to the different ones of the multiple different applications. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 30, 31)
-
-
10. A computer-implemented system for converting speech to text, the system comprising:
-
one or more computer processors; and one or more computer-readable devices including instructions that, when executed by the one or more computer processors, implement; an application of an operating system on distributed electronic devices, the application programmed to obtain, via a single instance of the application, both typed input and voice input, and to generate for a determined one of multiple different applications executable on a particular electronic device, text from either the typed input of the voice input depending on a user selection to place the particular electronic device in a typed input mode or a voice input mode, wherein voice input is accompanied by contextual metadata that describes a position of a cursor on a display of the particular electronic device at a time when the voice input is obtained; a plurality of base language models, each base language model corresponding to a particular semantic category, wherein the system is programmed to identify a particular base language model from the plurality of base language models, the identification based at least in part on the position of the cursor on the display of the particular electronic device at the time when the voice input is obtained, wherein each base language model is trained based on clusters identified in a bipartite cluster graph having clusters that correspond to particular categories of queries entered to a search engine by multiple different client devices, the clusters including search queries and corresponding search results, in the form of web pages, extracted from a historical log that are paired based on the web sites being top results for particular corresponding queries. - View Dependent Claims (11, 12, 13, 14, 15, 16)
-
-
17. A non-transitory computer-readable storage device encoded with a computer program product, the computer program product including instructions for speech-to-text conversion that, when executed, cause data processing apparatus to perform operations comprising:
-
receiving a voice input provided by a user of an electronic device and contextual metadata that describes a context of the electronic device at a time when the voice input is received, the voice input received by a service running on the electronic device that is capable of providing, from voice or typed input, text output to multiple different applications on the electronic device, and is arranged to select a particular application of the multiple different applications to receive the text output, and the contextual metadata identifying text for a form field displayed to a user and to which the voice input was directed; identifying a plurality of base language models, wherein each base language model corresponds to a distinct textual corpus of content, and wherein each base language model is trained based on clusters identified in a bipartite cluster graph having clusters that correspond to particular categories of queries entered to a search engine by multiple different client devices, the clusters including search queries and corresponding search results, in the form of web pages, extracted from a historical log that are paired based on the web sites being top results for particular corresponding queries; selecting a particular base language model, from among the identified plurality of base language models, the selection based at least in part on the text corresponding to the field of the form displayed to the user and to which the voice input was directed; and using the selected particular language model to convert the received voice input to a textual output wherein the service is; able to (a) receive typed input in a typed mode and voice input in a spoken mode, and adopts the spoken mode based on a user selection before receiving the voice input, and (b) in response to receiving typed or voice input, provide text output to a first application, and; arranged so that a particular instance of the service is external to the multiple different applications and provides text to different ones of the multiple different applications in a manner that speech-to-text conversion by the service is transparent to the different ones of the multiple different applications. - View Dependent Claims (18, 19, 20, 21, 22, 23, 24, 25, 26)
-
-
27. A computer-implemented method, comprising;
-
extracting, by a computer system, pairs from a historical log of query search results that includes a plurality of search queries and corresponding search results, each pair including a query and a website that corresponds to a search result for the query, wherein each query was previously entered to a search engine by multiple different electronic devices and each website was returned as a search result to a corresponding query; generating, by the computer system, a bipartite cluster graph based on the extracted pairs of queries and corresponding websites, the bipartite graph having paired queries and websites wherein the queries and websites are paired based on particular websites determined to be top search results for particular queries entered by the multiple different electronic devices; training, by the computer system, a plurality of language models based on clusters identified in the bipartite cluster graph, wherein the clusters correspond to particular categories of the queries entered to the search engine by the multiple different electronic devices; based on sample data that comprises one or more sample queries obtained from the multiple different electronic devices that provided input by one or more users into a web form, identifying K clusters from the cluster graph that are most significant to the sample queries, K being an integer; identifying a correlation between the input by the one or more users into the web form and one or more topics to which particular ones of the language models are directed; and generating, by the computer system, an interpolated language model for the web form based at least in part on (i) weighted contributions from the language models trained for each of the identified K clusters and (ii) the determined correlation. - View Dependent Claims (28, 29)
-
Specification