Method and apparatus for automatic entity disambiguation

US 7,672,833 B2
Filed: 09/22/2005
Issued: 03/02/2010
Est. Priority Date: 09/22/2005
Status: Active Grant

First Claim

Patent Images

1. A method for entity disambiguation for execution by one or more data processors, the method comprising:

carrying out a within-document co-reference resolution for a plurality of documents, the within-document co-reference resolution comprising;

initializing, by at least one of the data processors, a list of entities as an empty list;

processing, by at least one of the data processors, names in order of longest to shortest within each entity type class; and

comparing, by at least one of the data processors, each name to all entities with which it may be compatible, based on a token hash or database, the comparing being carried out via entity type-specific distance measures;

aggregating, by at least one of the data processors, attributes about each entity mentioned in each document; and

using, by at least one of the data processors, said entity attributes as features in determining which documents concern a same entity.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Entity disambiguation resolves which names, words, or phrases in text correspond to distinct persons, organizations, locations, or other entities in the context of an entire corpus. The invention is based largely on language-independent algorithms. Thus, it is applicable not only to unstructured text from arbitrary human languages, but also to semi-structured data, such as citation databases and the disambiguation of named entities mentioned in wire transfer transaction records for the purpose of detecting money-laundering activity. The system uses multiple types of context as evidence for determining whether two mentions correspond to the same entity and it automatically learns the weight of evidence of each context item via corpus statistics. The invention uses multiple search keys to efficiently find pairs of mentions that correspond to the same entity, while skipping billions of unnecessary comparisons, yielding a system with very high throughput that can be applied to truly massive data.

Citations

35 Claims

1. A method for entity disambiguation for execution by one or more data processors, the method comprising:
- carrying out a within-document co-reference resolution for a plurality of documents, the within-document co-reference resolution comprising;
  
  initializing, by at least one of the data processors, a list of entities as an empty list;
  
  processing, by at least one of the data processors, names in order of longest to shortest within each entity type class; and
  
  comparing, by at least one of the data processors, each name to all entities with which it may be compatible, based on a token hash or database, the comparing being carried out via entity type-specific distance measures;
  
  aggregating, by at least one of the data processors, attributes about each entity mentioned in each document; and
  
  using, by at least one of the data processors, said entity attributes as features in determining which documents concern a same entity.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)
- - 2. The method of claim 1, further comprising:
    - using one or more named entity recognition (NER) systems to provide input for entity disambiguation;
      
      each NER providing mention start and stop boundaries, entity type assertions, and confidence values.
  - 3. The method of claim 2, further comprising:
    - said NER systems using document-level or corpus-level information.
  - 4. The method of claim 2, further comprising:
    - using, by at least one of the data processors, a majority vote to re-categorize entities.
  - 5. The method of claim 2, further comprising:
    - changing, by at least one of the data processors, entity types based on evidence from the rest of a document upon identification of segmentation errors;
      
      recognizing, by at least one of the data processors, additional mentions if identical strings were recognized as entities, elsewhere in said document.
  - 6. The method of claim 1, further comprising:
    - resolving, by at least one of the data processors, entity type discrepancies via majority vote over all identical strings in a document that have been labeled, using name lists which comprise any of person titles, given names, family names, and location tokens, and confidence values to resolve ties.
  - 7. The method of claim 1, further comprising:
    - resolving, by at least one of the data processors, segmentation discrepancies with confidence values by using string starting and ending positions to resolve ties.
  - 8. The method of claim 7, further comprising:
    - detecting and repairing, by at least one of the data processors, at least some NER segmentation errors using document-level token sequence counts and word lists.
  - 9. The method of claim 1, further comprising:
    - identifying, by at least one of the data processors, additional mentions that were not recognized by the within document co-reference resolution that are identical to recognized mentions; and
      
      labeling, by at least one of the data processors, said mentions as entities.
  - 10. The method of claim 1, further comprising:
    - carrying out, by at least one of the data processors, entity type-specific parsing to extract entity attributes, generate standardized names, and populate data structures that are used to perform said method of within-document entity disambiguation.
  - 11. The method of claim 10, said data structures comprising in-memory token hashes or database records.
  - 12. The method of claim 10, further comprising:
    - in the ease of person mentions, said entity information extraction process detecting, by at least one of the data processors, any of prefixes and suffixes;
      
      removing, by at least one of the data processors, said detected prefixes and suffixes from a named itself; and
      
      storing, by at least one of the data processors, said detected prefixes and suffixes as attributes.
  - 13. The method of claim 12, further comprising:
    - using, by at least one of the data processors, a list of given names that includes gender probabilities and confidence levels in conjunction with titles to infer a likely gender of an entity.
  - 14. The method of claim 12, further comprising:
    - using, by at least one of the data processors, a list of titles as evidence for entity attributes including job category.
  - 15. The method of claim 10, further comprising:
    - using, by at least one of the data processors, a set of expressions to operate on token type sequences, as opposed to on names themselves, to resolve ambiguities.
  - 16. The method of claim 10, further comprising:
    - in the case of organizations, computing and storing, by at least one of the data processors, likely acronyms for subsequent use by said disambiguation method.
  - 17. The method of claim 1, wherein a person distance measure enforces gender consistency, deals with given name variants using a given name variant list, and allows initials to match long forms of names;
    - wherein if a match is found, the name is assigned, by at least one of the data processors, to an existing entity;
      
      wherein if a match is not found, the name is used, by at least one of the data processors, to seed a new entity; and
      
      wherein if a name matches multiple entities, the name is assigned, by at least one of the data processors, to a name having a most recent mention.
  - 18. The method of claim 1, further comprising:
    - using, by at least one of the data processors, language-specific components as pattern generators.
  - 19. The method of claim 1, further comprising:
    - creating, by at least one of the data processors, one observed string from another, wherein the two resulting mentions are variants of a same name.
  - 20. The method of claim 1, further comprising:
    - automatically teaming, by at least one of the data processors, desired transformation properties from a corpus of text;
      
      wherein the need for language-specific resources and rules is obviated.

21. A method for entity disambiguation for implementation by one or more data processors, the method comprising:
- using, by at least one of the data processors, cross-document disambiguation to identify an entity across a plurality of context domains, each domain comprising a finite set of context items, the context items comprising standardized names derived in an entity information extraction phase of within-document disambiguation;
  
  using, by at least one of the data processors, a logarithm of an inverse name frequency which comprises a number of standard names with which a context item appears as a weight indicating salience of each context item, wherein co-occurrence with a common name provides less indication that two mentions correspond to a same entity than co-occurrence with an uncommon name;
  
  using, by at least one of the data processors, a sparse count vector for recording all of the items that co-occur with a particular entity;
  
  creating, by at least one of the data processors, a sparse count vector of title tokens that occur with an entity; and
  
  computing, by at least one of the data processors, inverse name frequency weights for said title tokens.
- View Dependent Claims (22, 23, 24, 25, 26)
- - 22. The method of claim 21, further comprising:
    - creating, by at least one of the data processors, a sparse count vector of title tokens that occur with an entity; and
      
      computing, by at least one of the data processors, inverse name frequency weights for said title tokens.
  - 23. The method of claim 21, further comprising:
    - creating, by at least one of the data processors, a word vector space in an unsupervised fashion;
      
      wherein each document is represented by a vector in said vector space;
      
      deleting, by at least one of the data processors, all named entity mentions from each document prior to computing its vector to avoid double-counting context features;
      
      wherein an unsupervised clustering of some of the document vectors defines a segmentation of the vector space;
      
      uniquely assigning, by at least one of the data processors, each document to a single segment; and
      
      computing, by at least one of the data processors, inverse name frequency weights indicating the contexts salience based on a number of standardized names that occur in documents falling into each segment.
  - 24. The method of claim 21, further comprising:
    - defining, by at least one of the data processors, a separate distance measure per each specific context domain to discount co-occurrence with multiple items, as well as quantify an unexpected lack of shred co-occurrence;
      
      wherein a score produced by each distance measure is loosely interpreted as a function of the likelihood of two randomly generated contexts sharing an observed degree of similarity.
  - 25. The method of claim 24, further comprising:
    - automatically learning, by at least one of the data processors, the distance measures from unlabeled data by using the fact that pairs of unexpectedly common full names typically correspond to the same entity, whereas pairs with some shared name tokens and some differing name tokens typically correspond to different entities.
  - 26. The method of claim 21, further comprising:
    - using, by at least one of the data processors, a lexical (string) distance measure to determine whether two name tokens sound the same;
      
      wherein a large negative score indicates a log likelihood.

27. An article comprising a tangible machine-readable storage medium embodying instructions that when performed by one or more machines result in operations comprising:
- carrying out a within-document co-reference resolution for a plurality of documents;
  
  aggregating attributes about each entity mentioned in each document; and
  
  using the entity attributes as features in determining which documents concern a same entity;
  
  the within document co-reference resolution comprising;
  
  initializing a list of entities as an empty list;
  
  processing names in order of longest to shortest within each entity type class; and
  
  comparing each name to all entities with which it may be compatible, based on a token hash or database, the comparing being carried out via entity type-specific distance measures.

28. An article comprising a tangible machine-readable storage medium embodying instructions that when performed by one or more machines result in operations comprising:
- identifying a first plurality of entities by carrying out a within-document co-reference resolution for a first document;
  
  identifying a second plurality of entities by carrying out a within-document co-reference resolution for a second document;
  
  identifying at least one common entity that exists in the first plurality of entities and the second plurality of entities; and
  
  verifying the at least one common entity by comparing the first plurality of entities to the second plurality of entities;
  
  the within document co-reference resolution comprising;
  
  initializing a list of entities as an empty list;
  
  processing names in order of longest to shortest within each entity type class; and
  
  comparing each name to all entities with which it may be compatible, based on a token hash or database, the comparing being carried out via entity type-specific distance measures.

29. An article comprising a tangible machine-readable storage medium embodying instructions that when performed by one or more machines result in operations comprising:
- identifying a first plurality of entities by carrying out a within-document co-reference resolution for a first document;
  
  identifying a second plurality of entities by carrying out a within-document co-reference resolution for a second document;
  
  identifying at least one common entity that exists in the first plurality of entities and the second plurality of entities; and
  
  verifying the at least one common entity based on a first context of the at least one common entity in the first plurality of entities and the second context of the at least one common entity in the second plurality of entities;
  
  the within document co-reference resolution comprising;
  
  initializing a list of entities as an empty list;
  
  processing names in order of longest to shortest within each entity type class; and
  
  comparing each name to all entities with which it may be compatible, based on a token hash or database, the comparing being carried out via entity type-specific distance measures.

30. An article comprising a tangible machine-readable storage medium embodying instructions that when performed by one or more machines result in operations comprising:
- using cross-document disambiguation to identify an entity across a plurality of context domains, each domain comprising a finite set of context items, the context items comprising standardized names derived in an entity information extraction phase of within-document disambiguation;
  
  using a logarithm of an inverse name frequency which comprises a number of standard names with which a context item appears as a weight indicating salience of each context item, wherein co-occurrence with a common name provides less indication that two mentions correspond to a same entity than co-occurrence with an uncommon name;
  
  using a sparse count vector for recording all of the items that co-occur with a particular entity;
  
  creating a sparse count vector of title tokens that occur with an entity; and
  
  computing inverse name frequency weights for said title tokens.
- View Dependent Claims (31, 32, 33, 34, 35)
- - 31. An article as in claim 30, wherein the tangible machine-readable storage medium further embodies instructions that when performed by one or more machines result in operations comprising:
    - creating a sparse count vector of title tokens that occur with an entity; and
      
      computing inverse name frequency weights for said title tokens.
  - 32. An article as in claim 30, wherein the tangible machine-readable storage medium further embodies instructions that when performed by one or more machines result in operations comprising:
    - creating a word vector space in an unsupervised fashion;
      
      wherein each document is represented by a vector in said vector space;
      
      deleting all named entity mentions from each document prior to computing its vector to avoid double-counting context features;
      
      wherein an unsupervised clustering of some of the document vectors defines a segmentation of the vector space;
      
      uniquely assigning each document to a single segment; and
      
      computing inverse name frequency weights indicating the context'"'"'s salience based on a number of standardized names that occur in documents falling into each segment.
  - 33. An article as in claim 30, wherein the tangible machine-readable storage medium further embodies instructions that when performed by one or more machines result in operations comprising:
    - defining a separate distance measure per each specific context domain to discount co-occurrence with multiple items, as well as quantify an unexpected lack of shared co-occurrence;
      
      wherein a score produced by each distance measure is loosely interpreted as a function of the likelihood of two randomly generated contexts sharing an observed degree of similarity.
  - 34. An article as in claim 33, wherein the tangible machine-readable storage medium further embodies instructions that when performed by one or more machines result in operations comprising:
    - automatically learning the distance measures from unlabeled data by using the fact that pairs of unexpectedly common fall names typically correspond to the same entity, whereas pairs with some shared name tokens and some differing name tokens typically correspond to different entities.
  - 35. An article as in claim 30, wherein the tangible machine-readable storage medium further embodies instructions that when performed by one or more machines result in operations comprising:
    - using a lexical (string) distance measure to determine whether two name tokens sound the same;
      
      wherein a large negative score indicates a log likelihood.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Fair Isaac Corporation
Original Assignee
Fair Isaac Corporation
Inventors
Freitag, Dayne, Calmbach, Richard, Rohwer, Richard, Blume, Matthias, Zoldi, Scott
Primary Examiner(s)
Hudspeth; David R
Assistant Examiner(s)
Jackson; Jakieda R

Application Number

US11/234,692
Publication Number

US 20070067285A1
Time in Patent Office

1,622 Days
Field of Search

704/4, 704/8, 704/9, 704/10
US Class Current

704/10
CPC Class Codes

G06F 40/295 Named entity recognition

G06Q 10/10 Office automation; Time man...

Method and apparatus for automatic entity disambiguation

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

35 Claims

Specification

Solutions

Use Cases

Quick Links

Method and apparatus for automatic entity disambiguation

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

35 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links