Named entity variations for multimodal understanding systems
First Claim
1. A method for determining variations for named entities, comprising:
- accessing a named entity list comprising a canonical phrase for an entry in the named entity list;
determining a candidate variation for the canonical phrase, wherein the candidate variation is an alternative phrase for the canonical phrase;
determining, using click log data associated with a seed entity list of entities that are of a same type as the canonical phrase, a set of related websites from a plurality of websites, wherein the set of related websites is determined by mining the click log data to identify links that are most selected for queries corresponding with the seed entity list, wherein the mining of the click data includes determining a ratio of clicks received for a particular website to clicks received for queries in the seed entity list;
obtaining a likelihood ratio for a website in the set of related websites, wherein the likelihood ratio indicates a probability of a click on the particular website in response to a query from the seed entity list versus a probability of click on the particular website in response to a random query;
generating a score for the candidate variation based on evaluation of a distribution of click data for the candidate variation, wherein the score for the candidate variation is a weighted click vote over the web sites clicked for the candidate variation and is generated as a sum over all websites clicked in response to a query comprising the candidate variation, where each website is weighted by the likelihood ratio pertaining to the particular website;
determining whether to include the candidate variation in a language understanding model based on the score; and
training the language understanding model by updating the named entity list to include the candidate variation for the canonical phrase based on the score.
2 Assignments
0 Petitions
Accused Products
Abstract
Click logs are automatically mined to assist in discovering candidate variations for named entities. The named entities may be obtained from one or more sources and include an initial list of named entities. A search may be performed within one or more search engines to determine common phrases that are used to identify the named entity in addition to the named entity initially included in the named entity list. Click logs associated with results of past searches are automatically mined to discover what phrases determined from the searches are candidate variations for the named entity. The candidate variations are scored to assist in determining the variations to include within an understanding model. The variations may also be used when delivering responses and displayed output in the SLU system. For example, instead of using the listed named entity, a popular and/or shortened name may be used by the system.
-
Citations
20 Claims
-
1. A method for determining variations for named entities, comprising:
-
accessing a named entity list comprising a canonical phrase for an entry in the named entity list; determining a candidate variation for the canonical phrase, wherein the candidate variation is an alternative phrase for the canonical phrase; determining, using click log data associated with a seed entity list of entities that are of a same type as the canonical phrase, a set of related websites from a plurality of websites, wherein the set of related websites is determined by mining the click log data to identify links that are most selected for queries corresponding with the seed entity list, wherein the mining of the click data includes determining a ratio of clicks received for a particular website to clicks received for queries in the seed entity list; obtaining a likelihood ratio for a website in the set of related websites, wherein the likelihood ratio indicates a probability of a click on the particular website in response to a query from the seed entity list versus a probability of click on the particular website in response to a random query; generating a score for the candidate variation based on evaluation of a distribution of click data for the candidate variation, wherein the score for the candidate variation is a weighted click vote over the web sites clicked for the candidate variation and is generated as a sum over all websites clicked in response to a query comprising the candidate variation, where each website is weighted by the likelihood ratio pertaining to the particular website; determining whether to include the candidate variation in a language understanding model based on the score; and training the language understanding model by updating the named entity list to include the candidate variation for the canonical phrase based on the score. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A computer-readable storage device storing computer-executable instructions that, when executed by a processor, causes the processor to execute a method comprising:
-
accessing a named entity list comprising a canonical phrase for an entry in the named entity list; determining one or more candidate variations for the canonical phrase, wherein a candidate variation is an alternative phrase for the canonical phrase; determining, using click log data associated with a seed entity list of entities that are of a same type as the canonical phrase, a set of related websites from a plurality of websites, wherein the set of related websites is determined by mining the click log data to identify links that are most selected for queries corresponding with the seed entity list, wherein the mining of the click data includes determining a ratio of clicks received for a particular website to clicks received for queries in the seed entity list; obtaining a likelihood ratio for each website in the set of related websites, wherein the likelihood ratio indicates a probability of a click on the particular website in response to a query from the seed entity list versus a probability of click on the particular website in response to a random query; generating a score for the one or more candidate variations based on evaluation of a distribution of click data for the one or more candidate variations, wherein the score for each of the one or more candidate variations is a weighted click vote over the web sites clicked for the one or more candidate variations and is generated as a sum over all websites clicked in response to a query comprising the one or more candidate variations, where each website is weighted by the likelihood ratio pertaining to the particular website; updating the named entity list to include the one or more of the candidate variations for the canonical phrase within the named entity list based on the score; and training a language understanding model by updating the named entity list to include the candidate variation for the canonical phrase based on the score. - View Dependent Claims (11, 12, 13, 14, 15, 16)
-
-
17. A system for determining variations for named entities, comprising:
-
a processor and memory; an operating environment executing using the processor; and a display, wherein the processor is configured to perform actions comprising;
accessing a named entity list comprising a canonical phrase for at least one entry in the named entity list;determining one or more candidate variations for the canonical phrase, wherein a candidate variation is an alternative phrase for the canonical phrase; determining, using click log data associated with a seed entity list of entities that are of a same type as the canonical phrase, a set of related websites from a plurality of websites, wherein the set of related websites is determined by mining the click log data to identify links that are most selected for queries corresponding with the seed entity list, wherein the mining of the click data includes determining a ratio of clicks received for a particular website to clicks received for queries in the seed entity list; obtaining a likelihood ratio for a website in the set of related websites, wherein the likelihood ratio indicates a probability of a click on the particular website in response to a query from the seed entity list versus a probability of click on the particular website in response to a random query; generating a score for the one or more candidate variations based on evaluation of a distribution of click data for the candidate variation, wherein the score for the one or more candidate variation is a weighted click vote over the web sites clicked for the one or more candidate variations and is generated as a sum over all websites clicked in response to a query comprising the one or more candidate variations, where each website is weighted by the likelihood ratio pertaining to the particular web site; determining at least one candidate variation to include in a language understanding model using the score for the one or more candidate variations; training the language understanding model to include one or more of the candidate variations within the named entity list based on the determined candidate variations; and providing the trained language model as a resource in a distributed service. - View Dependent Claims (18, 19, 20)
-
Specification