Method and apparatus for automatic entity disambiguation
First Claim
1. A method for entity disambiguation for execution by one or more data processors, the method comprising:
- carrying out a within-document co-reference resolution for a plurality of documents, the within-document co-reference resolution comprising;
initializing, by at least one of the data processors, a list of entities as an empty list;
processing, by at least one of the data processors, names in order of longest to shortest within each entity type class; and
comparing, by at least one of the data processors, each name to all entities with which it may be compatible, based on a token hash or database, the comparing being carried out via entity type-specific distance measures;
aggregating, by at least one of the data processors, attributes about each entity mentioned in each document; and
using, by at least one of the data processors, said entity attributes as features in determining which documents concern a same entity.
1 Assignment
0 Petitions
Accused Products
Abstract
Entity disambiguation resolves which names, words, or phrases in text correspond to distinct persons, organizations, locations, or other entities in the context of an entire corpus. The invention is based largely on language-independent algorithms. Thus, it is applicable not only to unstructured text from arbitrary human languages, but also to semi-structured data, such as citation databases and the disambiguation of named entities mentioned in wire transfer transaction records for the purpose of detecting money-laundering activity. The system uses multiple types of context as evidence for determining whether two mentions correspond to the same entity and it automatically learns the weight of evidence of each context item via corpus statistics. The invention uses multiple search keys to efficiently find pairs of mentions that correspond to the same entity, while skipping billions of unnecessary comparisons, yielding a system with very high throughput that can be applied to truly massive data.
-
Citations
35 Claims
-
1. A method for entity disambiguation for execution by one or more data processors, the method comprising:
-
carrying out a within-document co-reference resolution for a plurality of documents, the within-document co-reference resolution comprising; initializing, by at least one of the data processors, a list of entities as an empty list; processing, by at least one of the data processors, names in order of longest to shortest within each entity type class; and comparing, by at least one of the data processors, each name to all entities with which it may be compatible, based on a token hash or database, the comparing being carried out via entity type-specific distance measures; aggregating, by at least one of the data processors, attributes about each entity mentioned in each document; and using, by at least one of the data processors, said entity attributes as features in determining which documents concern a same entity. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)
-
-
21. A method for entity disambiguation for implementation by one or more data processors, the method comprising:
-
using, by at least one of the data processors, cross-document disambiguation to identify an entity across a plurality of context domains, each domain comprising a finite set of context items, the context items comprising standardized names derived in an entity information extraction phase of within-document disambiguation; using, by at least one of the data processors, a logarithm of an inverse name frequency which comprises a number of standard names with which a context item appears as a weight indicating salience of each context item, wherein co-occurrence with a common name provides less indication that two mentions correspond to a same entity than co-occurrence with an uncommon name; using, by at least one of the data processors, a sparse count vector for recording all of the items that co-occur with a particular entity; creating, by at least one of the data processors, a sparse count vector of title tokens that occur with an entity; and computing, by at least one of the data processors, inverse name frequency weights for said title tokens. - View Dependent Claims (22, 23, 24, 25, 26)
-
-
27. An article comprising a tangible machine-readable storage medium embodying instructions that when performed by one or more machines result in operations comprising:
-
carrying out a within-document co-reference resolution for a plurality of documents; aggregating attributes about each entity mentioned in each document; and using the entity attributes as features in determining which documents concern a same entity; the within document co-reference resolution comprising; initializing a list of entities as an empty list; processing names in order of longest to shortest within each entity type class; and comparing each name to all entities with which it may be compatible, based on a token hash or database, the comparing being carried out via entity type-specific distance measures.
-
-
28. An article comprising a tangible machine-readable storage medium embodying instructions that when performed by one or more machines result in operations comprising:
-
identifying a first plurality of entities by carrying out a within-document co-reference resolution for a first document; identifying a second plurality of entities by carrying out a within-document co-reference resolution for a second document; identifying at least one common entity that exists in the first plurality of entities and the second plurality of entities; and verifying the at least one common entity by comparing the first plurality of entities to the second plurality of entities; the within document co-reference resolution comprising; initializing a list of entities as an empty list; processing names in order of longest to shortest within each entity type class; and comparing each name to all entities with which it may be compatible, based on a token hash or database, the comparing being carried out via entity type-specific distance measures.
-
-
29. An article comprising a tangible machine-readable storage medium embodying instructions that when performed by one or more machines result in operations comprising:
-
identifying a first plurality of entities by carrying out a within-document co-reference resolution for a first document; identifying a second plurality of entities by carrying out a within-document co-reference resolution for a second document; identifying at least one common entity that exists in the first plurality of entities and the second plurality of entities; and verifying the at least one common entity based on a first context of the at least one common entity in the first plurality of entities and the second context of the at least one common entity in the second plurality of entities; the within document co-reference resolution comprising; initializing a list of entities as an empty list; processing names in order of longest to shortest within each entity type class; and comparing each name to all entities with which it may be compatible, based on a token hash or database, the comparing being carried out via entity type-specific distance measures.
-
-
30. An article comprising a tangible machine-readable storage medium embodying instructions that when performed by one or more machines result in operations comprising:
-
using cross-document disambiguation to identify an entity across a plurality of context domains, each domain comprising a finite set of context items, the context items comprising standardized names derived in an entity information extraction phase of within-document disambiguation; using a logarithm of an inverse name frequency which comprises a number of standard names with which a context item appears as a weight indicating salience of each context item, wherein co-occurrence with a common name provides less indication that two mentions correspond to a same entity than co-occurrence with an uncommon name; using a sparse count vector for recording all of the items that co-occur with a particular entity; creating a sparse count vector of title tokens that occur with an entity; and computing inverse name frequency weights for said title tokens. - View Dependent Claims (31, 32, 33, 34, 35)
-
Specification