Automatic discovery of new entities using graph reconciliation
First Claim
1. A computer system comprising:
- memory storing;
a target data graph, anda plurality of source data graphs generated from analysis of source documents, where each source data graph;
is associated with a source document having an associated domain,has a source entity that does not exist in the target data graph but has an entity type that exists in the target data graph, andincludes fact tuples, where a fact tuple identifies;
a subject entity,a relationship connecting the subject entity to an object entity, wherein the relationship is associated with the entity type of the subject entity in the target data graph, andthe object entity;
at least one processor; and
memory storing instructions that, when executed by the at least one processor, cause the computer system to perform operations including;
generating a cluster of source data graphs, the cluster including source data graphs associated with a first source entity of a first source entity type that shares at least two fact tuples that have the first source entity as the subject entity and a determinative relationship as the relationship connecting the subject entity to the object entity,iteratively splitting the source data graphs in the cluster into a plurality of buckets of the source data graphs based on a plurality of fact tuples, wherein each iteration operates on a distinct determinative relationship and each of the buckets for the iteration includes source data graphs that share a value for the fact tuple,discarding one or more buckets from the cluster that are associated with less than a minimum number of domains represented by the source data graphs in the cluster,generating a reconciled graph by merging the source data graphs remaining in the cluster when the source data graphs meet a similarity threshold, andgenerating a suggested new entity and entity relationships for the target data graph based on the reconciled graph.
2 Assignments
0 Petitions
Accused Products
Abstract
Systems and methods can identify potential entities from facts generated from web-based sources. For example, a method may include generating a source data graph for a potential entity from a text document in which the potential entity is identified. The source data graph represents the potential entity and facts about the potential entity from the text document. The method may also include clustering a plurality of source data graphs, each for a different text document, by entity name and type, wherein at least one cluster includes the potential entity. The method may also include verifying the potential entity using the cluster by corroborating at least a quantity of determinative facts about the potential entity and storing the potential entity and the facts about the potential entity, wherein each stored fact has at least one associated text document.
41 Citations
24 Claims
-
1. A computer system comprising:
-
memory storing; a target data graph, and a plurality of source data graphs generated from analysis of source documents, where each source data graph; is associated with a source document having an associated domain, has a source entity that does not exist in the target data graph but has an entity type that exists in the target data graph, and includes fact tuples, where a fact tuple identifies; a subject entity, a relationship connecting the subject entity to an object entity, wherein the relationship is associated with the entity type of the subject entity in the target data graph, and the object entity; at least one processor; and memory storing instructions that, when executed by the at least one processor, cause the computer system to perform operations including; generating a cluster of source data graphs, the cluster including source data graphs associated with a first source entity of a first source entity type that shares at least two fact tuples that have the first source entity as the subject entity and a determinative relationship as the relationship connecting the subject entity to the object entity, iteratively splitting the source data graphs in the cluster into a plurality of buckets of the source data graphs based on a plurality of fact tuples, wherein each iteration operates on a distinct determinative relationship and each of the buckets for the iteration includes source data graphs that share a value for the fact tuple, discarding one or more buckets from the cluster that are associated with less than a minimum number of domains represented by the source data graphs in the cluster, generating a reconciled graph by merging the source data graphs remaining in the cluster when the source data graphs meet a similarity threshold, and generating a suggested new entity and entity relationships for the target data graph based on the reconciled graph. - View Dependent Claims (2, 3, 4, 5, 6, 7, 24)
-
-
8. A method comprising:
-
generating, using at least one processor, a plurality of source data graphs, each source data graph being associated with a source document hosted at a domain and a source entity, the source data graph including tuples for the source entity, each tuple having a predicate linking the source entity to an object entity, wherein a target data graph lacks the source entity but an entity type of the source entity exists in the target data graph; clustering, using the at least one processor, source data graphs by source entity name and source entity type; dividing, using the at least one processor, the cluster into buckets of the source data graphs based on a plurality of fact tuples, wherein each of the plurality of fact tuples includes a subject entity name matching the source entity name and a distinct determinative relationship, where each bucket includes source data graphs that share the source entity name, the distinct determinative relationship, and an object entity for the distinct determinative relationship; and generating a candidate entity for the target data graph from at least some of the buckets, the candidate entity being from a first bucket that is associated with a minimum number of domains represented by the source data graphs in the first bucket. - View Dependent Claims (9, 10, 11, 12, 13, 14, 15)
-
-
16. A method of automatically identifying new entities for a target data graph comprising:
-
generating, for each of a plurality of text documents, a source data graph for a potential entity from the text document in which the potential entity is identified, the source data graph representing the potential entity and facts about the potential entity from the text document and the text document being hosted at a domain, wherein the target data graph lacks the potential entity but an entity type of the potential entity exists in the target data graph; clustering source data graphs by entity name and type, wherein at least one cluster is for the potential entity; iteratively dividing the cluster into a plurality of buckets of source data graphs based on a plurality of fact tuples, wherein each of the plurality of fact tuples includes a subject entity matching the potential entity and a distinct determinative relationship; verifying the potential entity using the cluster by; corroborating at least a quantity of determinative facts about the potential entity, and determining the potential entity is included in a bucket associated with a minimum number of different domains represented by the source data graphs in the bucket after iteratively dividing the cluster; and storing the potential entity and the facts about the potential entity. - View Dependent Claims (17, 18, 19, 20, 21, 22, 23)
-
Specification