CONFIDENCE LINKS BETWEEN NAME ENTITIES IN DISPARATE DOCUMENTS
First Claim
1. A system that detects similarities between name strings in a document set, comprising:
- a preprocessing module configured to;
extract a plurality of name strings from the document set;
a matching module configured to;
detect possible matching pairs from the plurality of name strings, andassign a plurality of similarity scores to each of the possible matching pairs using a plurality of algorithms; and
a generation module configured to;
generate a set of equivalent names by accumulating name strings from the possible matching pairs based on a comparison between the similarity scores and a threshold.
4 Assignments
0 Petitions
Accused Products
Abstract
The invention relates to cross-document entity co-reference systems in which naturally occurring entity mentions in a document corpus are analyzed and transformed into name clusters that represent global entities. In a first aspect of the invention, a name variation module analyzes naturally occurring names of entities extracted from the document corpus and provides an initial set of equivalent names that could refer to the same real world entity. In a second aspect of the invention, a disambiguation module takes the initial set of equivalent names and uses an agglomerative clustering algorithm to disambiguate the potentially co-referent named entities.
93 Citations
24 Claims
-
1. A system that detects similarities between name strings in a document set, comprising:
-
a preprocessing module configured to; extract a plurality of name strings from the document set; a matching module configured to; detect possible matching pairs from the plurality of name strings, and assign a plurality of similarity scores to each of the possible matching pairs using a plurality of algorithms; and a generation module configured to; generate a set of equivalent names by accumulating name strings from the possible matching pairs based on a comparison between the similarity scores and a threshold. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A method that detects similarities between name strings in a document set, comprising:
-
assigning a plurality of similarity scores to each of a plurality of possible matching pairs of the name strings using a plurality of algorithms; and generating a set of equivalent names by accumulating name strings from the possible matching pairs based on a comparison between the similarity scores and a threshold. - View Dependent Claims (11, 12, 13, 14)
-
-
15. A method for determining a set of equivalent names for an input name string in a document set, comprising:
-
retrieving from the document set a plurality of name strings that each include tokens from the input name string; generating a token-subset tree from the input name string and the plurality of name strings, wherein the tree comprises a set of nodes and a set of edges, and wherein each node corresponds to one of the name strings, and each edge indicates a matching relationship between a pair of nodes in the tree; determining an ambiguity score for each node in the tree; and generating a set of equivalent names by selectively accumulating the name strings corresponding to ancestral or descendant nodes of the input name string based on the ambiguity scores. - View Dependent Claims (16, 17, 18)
-
-
19. A system that detects a cluster of name strings that refer to the same global entity, comprising:
-
a preprocessing module configured to; retrieve from a document set name strings for which entity clusters are to be created, and receive an initial set of equivalent names for each of the name strings; and a disambiguation module configured to, for each name string; split the set of equivalent names into subsets of singleton clusters, each singleton cluster representing a potentially unique global entity, and iteratively merge the singleton clusters into one or more global entity clusters by matching features associated with the singleton clusters and the global entity clusters. - View Dependent Claims (20, 21)
-
-
22. A method that detects a cluster of name strings that refer to the same global entity, comprising:
-
receiving an initial set of equivalent names for name strings for which entity clusters are to be created; splitting the set of equivalent names into subsets of singleton clusters, each singleton cluster representing a potentially unique global entity; and iteratively merging the singleton clusters into one or more global entity clusters by matching features associated with the singleton clusters and the global entity clusters. - View Dependent Claims (23, 24)
-
Specification