Named entity variations for multimodal understanding systems

US 9,916,301 B2
Filed: 12/21/2012
Issued: 03/13/2018
Est. Priority Date: 12/21/2012
Status: Active Grant

First Claim

Patent Images

1. A method for determining variations for named entities, comprising:

accessing a named entity list comprising a canonical phrase for an entry in the named entity list;

determining a candidate variation for the canonical phrase, wherein the candidate variation is an alternative phrase for the canonical phrase;

determining, using click log data associated with a seed entity list of entities that are of a same type as the canonical phrase, a set of related websites from a plurality of websites, wherein the set of related websites is determined by mining the click log data to identify links that are most selected for queries corresponding with the seed entity list, wherein the mining of the click data includes determining a ratio of clicks received for a particular website to clicks received for queries in the seed entity list;

obtaining a likelihood ratio for a website in the set of related websites, wherein the likelihood ratio indicates a probability of a click on the particular website in response to a query from the seed entity list versus a probability of click on the particular website in response to a random query;

generating a score for the candidate variation based on evaluation of a distribution of click data for the candidate variation, wherein the score for the candidate variation is a weighted click vote over the web sites clicked for the candidate variation and is generated as a sum over all websites clicked in response to a query comprising the candidate variation, where each website is weighted by the likelihood ratio pertaining to the particular website;

determining whether to include the candidate variation in a language understanding model based on the score; and

training the language understanding model by updating the named entity list to include the candidate variation for the canonical phrase based on the score.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Click logs are automatically mined to assist in discovering candidate variations for named entities. The named entities may be obtained from one or more sources and include an initial list of named entities. A search may be performed within one or more search engines to determine common phrases that are used to identify the named entity in addition to the named entity initially included in the named entity list. Click logs associated with results of past searches are automatically mined to discover what phrases determined from the searches are candidate variations for the named entity. The candidate variations are scored to assist in determining the variations to include within an understanding model. The variations may also be used when delivering responses and displayed output in the SLU system. For example, instead of using the listed named entity, a popular and/or shortened name may be used by the system.

Citations

20 Claims

1. A method for determining variations for named entities, comprising:
- accessing a named entity list comprising a canonical phrase for an entry in the named entity list;
  
  determining a candidate variation for the canonical phrase, wherein the candidate variation is an alternative phrase for the canonical phrase;
  
  determining, using click log data associated with a seed entity list of entities that are of a same type as the canonical phrase, a set of related websites from a plurality of websites, wherein the set of related websites is determined by mining the click log data to identify links that are most selected for queries corresponding with the seed entity list, wherein the mining of the click data includes determining a ratio of clicks received for a particular website to clicks received for queries in the seed entity list;
  
  obtaining a likelihood ratio for a website in the set of related websites, wherein the likelihood ratio indicates a probability of a click on the particular website in response to a query from the seed entity list versus a probability of click on the particular website in response to a random query;
  
  generating a score for the candidate variation based on evaluation of a distribution of click data for the candidate variation, wherein the score for the candidate variation is a weighted click vote over the web sites clicked for the candidate variation and is generated as a sum over all websites clicked in response to a query comprising the candidate variation, where each website is weighted by the likelihood ratio pertaining to the particular website;
  
  determining whether to include the candidate variation in a language understanding model based on the score; and
  
  training the language understanding model by updating the named entity list to include the candidate variation for the canonical phrase based on the score.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The method of claim 1, wherein determining of the candidate variation comprises performing a search of the canonical phrase using a search engine.
  - 3. The method of claim 1, further comprising:
    - using the updated named entity list to process a received query.
  - 4. The method of claim 3, wherein the using further comprises outputting the candidate variation from the named entity list in a response to a user.
  - 5. The method of claim 4, wherein the using further comprises replacing the canonical phrase received in the query with at least one candidate variation.
  - 6. The method of claim 1, wherein determining the candidate variation for the canonical phrase further comprises one or more of:
    - evaluation of click log data to determine a distribution of clicks between different websites and scraping search logs of a search engine.
  - 7. The method of claim 1, wherein the score is a value determined from cross entropy processing of the distribution of click data.
  - 8. The method of claim 1, wherein the score is generated by processing that compares the distribution of click data for the candidate variation with a distribution of click data for the set of related websites.
  - 9. The method of claim 1, wherein determining the candidate variation to include in the language understanding model comprises determining whether a score of the candidate variation is above a threshold, and the training of the language understanding model further comprises updating the named entity list to include the candidate variation in response to determining the score for the candidate variation is above the threshold.

10. A computer-readable storage device storing computer-executable instructions that, when executed by a processor, causes the processor to execute a method comprising:
- accessing a named entity list comprising a canonical phrase for an entry in the named entity list;
  
  determining one or more candidate variations for the canonical phrase, wherein a candidate variation is an alternative phrase for the canonical phrase;
  
  determining, using click log data associated with a seed entity list of entities that are of a same type as the canonical phrase, a set of related websites from a plurality of websites, wherein the set of related websites is determined by mining the click log data to identify links that are most selected for queries corresponding with the seed entity list, wherein the mining of the click data includes determining a ratio of clicks received for a particular website to clicks received for queries in the seed entity list;
  
  obtaining a likelihood ratio for each website in the set of related websites, wherein the likelihood ratio indicates a probability of a click on the particular website in response to a query from the seed entity list versus a probability of click on the particular website in response to a random query;
  
  generating a score for the one or more candidate variations based on evaluation of a distribution of click data for the one or more candidate variations, wherein the score for each of the one or more candidate variations is a weighted click vote over the web sites clicked for the one or more candidate variations and is generated as a sum over all websites clicked in response to a query comprising the one or more candidate variations, where each website is weighted by the likelihood ratio pertaining to the particular website;
  
  updating the named entity list to include the one or more of the candidate variations for the canonical phrase within the named entity list based on the score; and
  
  training a language understanding model by updating the named entity list to include the candidate variation for the canonical phrase based on the score.
- View Dependent Claims (11, 12, 13, 14, 15, 16)
- - 11. The computer-readable storage device of claim 10, wherein determining the one or more candidate variations comprises performing a search with at least one of the canonical phrases and obtaining results used to determine the candidate variations.
  - 12. The computer-readable storage device of claim 10, wherein determining the one or more candidate variations comprises accessing search data to determine a scoring for the one or more candidate variations.
  - 13. The computer-readable storage device of claim 10, wherein the using of the one or more candidate variations in an interaction with the user further comprises outputting the one or more candidate variations from the named entity list.
  - 14. The computer-readable storage device of claim 10, wherein determining the one or more candidate variations further comprises one or more of:
    - evaluation of click log data to determine a distribution of clicks between different websites and scraping search logs of a search engine.
  - 15. The computer-readable storage device of claim 10,wherein the method further comprises using the one or more candidate variations in an interaction with a user, andwherein the using of the one or more candidate variations in an interaction with a user further comprises replacing the canonical phrase with at least one of the candidate variations.
  - 16. The computer-readable storage device of claim 10, wherein the score is a value determined from cross entropy processing of the distribution of click data.

17. A system for determining variations for named entities, comprising:
- a processor and memory;
  
  an operating environment executing using the processor; and
  
  a display, wherein the processor is configured to perform actions comprising;
  
  accessing a named entity list comprising a canonical phrase for at least one entry in the named entity list;
  
  determining one or more candidate variations for the canonical phrase, wherein a candidate variation is an alternative phrase for the canonical phrase;
  
  determining, using click log data associated with a seed entity list of entities that are of a same type as the canonical phrase, a set of related websites from a plurality of websites, wherein the set of related websites is determined by mining the click log data to identify links that are most selected for queries corresponding with the seed entity list, wherein the mining of the click data includes determining a ratio of clicks received for a particular website to clicks received for queries in the seed entity list;
  
  obtaining a likelihood ratio for a website in the set of related websites, wherein the likelihood ratio indicates a probability of a click on the particular website in response to a query from the seed entity list versus a probability of click on the particular website in response to a random query;
  
  generating a score for the one or more candidate variations based on evaluation of a distribution of click data for the candidate variation, wherein the score for the one or more candidate variation is a weighted click vote over the web sites clicked for the one or more candidate variations and is generated as a sum over all websites clicked in response to a query comprising the one or more candidate variations, where each website is weighted by the likelihood ratio pertaining to the particular web site;
  
  determining at least one candidate variation to include in a language understanding model using the score for the one or more candidate variations;
  
  training the language understanding model to include one or more of the candidate variations within the named entity list based on the determined candidate variations; and
  
  providing the trained language model as a resource in a distributed service.
- View Dependent Claims (18, 19, 20)
- - 18. The system of claim 17, wherein the providing further comprises outputting at least one of the candidate variations from the named entity list in response to a received query.
  - 19. The system of claim 17, wherein the providing further comprises replacing the canonical phrase with at least one of the candidate variations in an interaction with a user.
  - 20. The system of claim 17, wherein the score is generated by processing that compares the distribution of click data for the candidate variation with a distribution of click data for the set of related websites.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Inventors
Hillard, Dustin, Celikyilmaz, Fethiye Asli, Hakkani-Tur, Dilek, Iyer, Rukmini, Tur, Gokhan
Primary Examiner(s)
Sirjani, Fariba

Application Number

US13/725,614
Publication Number

US 20140180676A1
Time in Patent Office

1,908 Days
Field of Search

704 9
US Class Current
CPC Class Codes

G06F 40/295 Named entity recognition

Named entity variations for multimodal understanding systems

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Named entity variations for multimodal understanding systems

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links