Automatic disambiguation based on a reference resource
First Claim
1. A computer implemented method, performed by a computer having a processor, of disambiguating references to named entities, comprising:
- identifying a surface form of a named entity in a text, the surface form being an ambiguous orthographic representation of a common name for the named entity, the surface form having a corresponding surface form reference in a surface form reference database;
enumerating, from the surface form reference, a plurality of different reference named entities based on the identified surface form of the named entity, wherein the surface form is associated in the surface form reference with the plurality of different reference named entities each being formed of a different set of words, and each of the different reference named entities is associated with a named entity reference, the named entity references being stored in a named entity reference database that is separate from the surface form reference database, each of the named entity references associating one of the different reference named entities to multiple entity indicators, the entity indicators including both labels applied to a respective named entity in an information resource, and context indicators applied to the respective named entity in the information resource, in which the labels comprise classifying identifiers applied to the respective named entities in the information resource;
evaluating, with the processor, one or more measures of correlation between one or more of the entity indicators in the information resource for each of the identified reference named entities, and the text, the evaluation including comparisons of the text to both the labels and the context indicators;
identifying, with the processor, one of the reference named entities for which the associated entity indicators have a relatively high correlation to the text; and
providing a disambiguation output that indicates the identified reference named entity to be associated with the surface form of the named entity in the text.
2 Assignments
0 Petitions
Accused Products
Abstract
A novel system for automatically indicating the specific identity of ambiguous named entities is provided. An automatic disambiguation data collection is created using a reference resource. Explicit named entities are catalogued from the reference resource, together with various abbreviated, alternative, and casual ways of referring to the named entities. Entity indicators, such as labels and context indicators associated with the named entities in the reference resource, are also catalogued. The automatic disambiguation collection can then be used as a basis for evaluating ambiguous references to named entities in text content provided in different applications. The content surrounding the ambiguous reference may be compared with the entity indicators to find a good match, indicating that the named entity associated with the matching entity indicators is the intended identity of the ambiguous reference, which can be automatically provided to a user.
62 Citations
17 Claims
-
1. A computer implemented method, performed by a computer having a processor, of disambiguating references to named entities, comprising:
-
identifying a surface form of a named entity in a text, the surface form being an ambiguous orthographic representation of a common name for the named entity, the surface form having a corresponding surface form reference in a surface form reference database; enumerating, from the surface form reference, a plurality of different reference named entities based on the identified surface form of the named entity, wherein the surface form is associated in the surface form reference with the plurality of different reference named entities each being formed of a different set of words, and each of the different reference named entities is associated with a named entity reference, the named entity references being stored in a named entity reference database that is separate from the surface form reference database, each of the named entity references associating one of the different reference named entities to multiple entity indicators, the entity indicators including both labels applied to a respective named entity in an information resource, and context indicators applied to the respective named entity in the information resource, in which the labels comprise classifying identifiers applied to the respective named entities in the information resource; evaluating, with the processor, one or more measures of correlation between one or more of the entity indicators in the information resource for each of the identified reference named entities, and the text, the evaluation including comparisons of the text to both the labels and the context indicators; identifying, with the processor, one of the reference named entities for which the associated entity indicators have a relatively high correlation to the text; and providing a disambiguation output that indicates the identified reference named entity to be associated with the surface form of the named entity in the text. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
-
-
17. A computer-readable storage medium comprising computer-executable instructions which, when executed by a computing device having a processor, enable the computing device to prepare and apply an automatic disambiguation system, comprising steps of:
-
extracting, with the processor, a collection of surface forms associated with a plurality of named entities, that are different from the surface forms, from an information resource; extracting, with the processor, a collection of labels associated with the named entities from the information resource; extracting, with the processor, a collection of context indicators associated with the named entities from the information resource; when provided with a surface form in a text sample having different units of text, evaluating, with the processor, a measure of correlation of entity indicators associated with the surface form in the text sample with the labels and the context indicators associated with the named entities associated with the surface form in the collection of surface forms by first evaluating one or more measures of similarity between a larger one of the units of text and the entity indicators, and if more than one of the collection of extracted named entities from the information resource is identified as having associated entity indicators with a relatively high correlation to the entity indicators for the surface form in the sample of text, then evaluating the one or more measures of similarity between an iteratively smaller one of the units of text and the entity indicators, until a unique extracted named entity from the information resource is identified as having associated entity indicators with a relatively high correlation to the entity indicators for the surface form in the sample of text; and providing a display, based on the measure of correlation, showing a representation of the text sample, the display including an indication of one of the named entities to be a disambiguation of the surface form in the text sample, the indication of the one of the named entities being positioned proximate to the surface form in the representation of the text sample.
-
Specification