Automatic discovery of new entities using graph reconciliation

US 9,785,696 B1
Filed: 10/22/2013
Issued: 10/10/2017
Est. Priority Date: 10/04/2013
Status: Expired due to Fees

First Claim

Patent Images

1. A computer system comprising:

memory storing;

a target data graph, anda plurality of source data graphs generated from analysis of source documents, where each source data graph;

is associated with a source document having an associated domain,has a source entity that does not exist in the target data graph but has an entity type that exists in the target data graph, andincludes fact tuples, where a fact tuple identifies;

a subject entity,a relationship connecting the subject entity to an object entity, wherein the relationship is associated with the entity type of the subject entity in the target data graph, andthe object entity;

at least one processor; and

memory storing instructions that, when executed by the at least one processor, cause the computer system to perform operations including;

generating a cluster of source data graphs, the cluster including source data graphs associated with a first source entity of a first source entity type that shares at least two fact tuples that have the first source entity as the subject entity and a determinative relationship as the relationship connecting the subject entity to the object entity,iteratively splitting the source data graphs in the cluster into a plurality of buckets of the source data graphs based on a plurality of fact tuples, wherein each iteration operates on a distinct determinative relationship and each of the buckets for the iteration includes source data graphs that share a value for the fact tuple,discarding one or more buckets from the cluster that are associated with less than a minimum number of domains represented by the source data graphs in the cluster,generating a reconciled graph by merging the source data graphs remaining in the cluster when the source data graphs meet a similarity threshold, andgenerating a suggested new entity and entity relationships for the target data graph based on the reconciled graph.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems and methods can identify potential entities from facts generated from web-based sources. For example, a method may include generating a source data graph for a potential entity from a text document in which the potential entity is identified. The source data graph represents the potential entity and facts about the potential entity from the text document. The method may also include clustering a plurality of source data graphs, each for a different text document, by entity name and type, wherein at least one cluster includes the potential entity. The method may also include verifying the potential entity using the cluster by corroborating at least a quantity of determinative facts about the potential entity and storing the potential entity and the facts about the potential entity, wherein each stored fact has at least one associated text document.

41 Citations

View as Search Results

24 Claims

1. A computer system comprising:
- memory storing;
  
  a target data graph, anda plurality of source data graphs generated from analysis of source documents, where each source data graph;
  
  is associated with a source document having an associated domain,has a source entity that does not exist in the target data graph but has an entity type that exists in the target data graph, andincludes fact tuples, where a fact tuple identifies;
  
  a subject entity,a relationship connecting the subject entity to an object entity, wherein the relationship is associated with the entity type of the subject entity in the target data graph, andthe object entity;
  
  at least one processor; and
  
  memory storing instructions that, when executed by the at least one processor, cause the computer system to perform operations including;
  
  generating a cluster of source data graphs, the cluster including source data graphs associated with a first source entity of a first source entity type that shares at least two fact tuples that have the first source entity as the subject entity and a determinative relationship as the relationship connecting the subject entity to the object entity,iteratively splitting the source data graphs in the cluster into a plurality of buckets of the source data graphs based on a plurality of fact tuples, wherein each iteration operates on a distinct determinative relationship and each of the buckets for the iteration includes source data graphs that share a value for the fact tuple,discarding one or more buckets from the cluster that are associated with less than a minimum number of domains represented by the source data graphs in the cluster,generating a reconciled graph by merging the source data graphs remaining in the cluster when the source data graphs meet a similarity threshold, andgenerating a suggested new entity and entity relationships for the target data graph based on the reconciled graph.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 24)
- - 2. The system of claim 1, wherein iteratively splitting the cluster into a plurality of buckets includes:
    - splitting the cluster into first buckets based on a first of the plurality of fact tuples so that source data graphs sharing the first of the plurality of fact tuples are in a same first bucket; and
      
      ,splitting, on the next iteration, the first buckets into second buckets based on a second of the plurality of fact tuples so that source data graphs sharing the second of the plurality of fact tuples are in a same second bucket.
  - 3. The system of claim 1, wherein the minimum number depends on the number of domains represented by the source documents.
  - 4. The system of claim 1, wherein the similarity threshold is based on a quantity of fact tuples for a first source data graph in the cluster that match fact tuples for a second source data graph in the cluster.
  - 5. The system of claim 1, wherein the reconciled graph is a first reconciled graph and the memory stores instructions that, when executed by the at least one processor, causes the computer system to further perform operations including:
    - generating a second reconciled graph from a second cluster for source data graphs associated with second subject entities of a second source entity type that share an entity name and that share object entities for at least two determinative relationships;
      
      determining that the first source entity type and the second source entity type are the same and that the first reconciled graph and the second reconciled graph meet the similarity threshold; and
      
      generating a third reconciled graph by merging the first reconciled graph and the second reconciled graph,wherein the suggested new entity and entity relationships are based on the third reconciled graph.
  - 6. The system of claim 1, wherein the memory stores instructions that, when executed by the at least one processor, causes the computer system to further perform operations including:
    - discarding fact tuples representing conflicting facts from a particular source data graph prior to generating the cluster.
  - 7. The system of claim 1, wherein the memory stores instructions that, when executed by the at least one processor, causes the computer system to further perform operations including:
    - generating inverse fact tuples for a particular source data graph prior to generating the cluster, the inverse fact tuples being a subset of the fact tuples included in the particular source data graph.
  - 24. The system of claim 1, wherein each source data graph in the plurality of source data graphs includes at least one fact tuple having an object entity that exists in the target data graph.

8. A method comprising:
- generating, using at least one processor, a plurality of source data graphs, each source data graph being associated with a source document hosted at a domain and a source entity, the source data graph including tuples for the source entity, each tuple having a predicate linking the source entity to an object entity, wherein a target data graph lacks the source entity but an entity type of the source entity exists in the target data graph;
  
  clustering, using the at least one processor, source data graphs by source entity name and source entity type;
  
  dividing, using the at least one processor, the cluster into buckets of the source data graphs based on a plurality of fact tuples, wherein each of the plurality of fact tuples includes a subject entity name matching the source entity name and a distinct determinative relationship, where each bucket includes source data graphs that share the source entity name, the distinct determinative relationship, and an object entity for the distinct determinative relationship; and
  
  generating a candidate entity for the target data graph from at least some of the buckets, the candidate entity being from a first bucket that is associated with a minimum number of domains represented by the source data graphs in the first bucket.
- View Dependent Claims (9, 10, 11, 12, 13, 14, 15)
- - 9. The method of claim 8, wherein dividing the cluster into buckets includes:
    - selecting a determinative predicate from the tuples in the cluster;
      
      generating a bucket for unique object entities for the determinative predicate, so that source data graphs with a first object entity for the determinative predicate are placed in a first bucket and source data graphs with a second object entity for the determinative predicate are placed in a second bucket; and
      
      repeating the selecting and generating on the generated buckets a quantity of times.
  - 10. The method of claim 9, wherein dividing the cluster into buckets includes:
    - discarding buckets with less than the minimum number of different domains prior to repeating the selecting and generating.
  - 11. The method of claim 10, wherein each source data graph includes at least one tuple with an object entity that exists in the target data graph.
  - 12. The method of claim 10, wherein the minimum number depends on the number of domains represented by the source documents.
  - 13. The method of claim 9, wherein after the dividing, for each bucket not discarded the method further comprises:
    - generating a reconciled graph from source data graphs in the bucket; and
      
      generating the candidate entity using the reconciled graph.
  - 14. The method of claim 13, the method further comprising:
    - determining that a first source entity type of a first reconciled graph and a second source entity type of a second reconciled graph are the same and that the first reconciled graph and the second reconciled graph meet a similarity threshold; and
      
      generating a third reconciled graph by merging the first reconciled graph and the second reconciled graph,wherein the candidate entity is based on the third reconciled graph.
  - 15. The method of claim 13, further comprising:
    - discarding fact tuples representing conflicting facts from the reconciled graphs prior to generating the candidate entity.

16. A method of automatically identifying new entities for a target data graph comprising:
- generating, for each of a plurality of text documents, a source data graph for a potential entity from the text document in which the potential entity is identified, the source data graph representing the potential entity and facts about the potential entity from the text document and the text document being hosted at a domain, wherein the target data graph lacks the potential entity but an entity type of the potential entity exists in the target data graph;
  
  clustering source data graphs by entity name and type, wherein at least one cluster is for the potential entity;
  
  iteratively dividing the cluster into a plurality of buckets of source data graphs based on a plurality of fact tuples, wherein each of the plurality of fact tuples includes a subject entity matching the potential entity and a distinct determinative relationship;
  
  verifying the potential entity using the cluster by;
  
  corroborating at least a quantity of determinative facts about the potential entity, anddetermining the potential entity is included in a bucket associated with a minimum number of different domains represented by the source data graphs in the bucket after iteratively dividing the cluster; and
  
  storing the potential entity and the facts about the potential entity.
- View Dependent Claims (17, 18, 19, 20, 21, 22, 23)
- - 17. The method of claim 16, further comprising:
    - generating information that is presented to a user using the stored potential entity and the stored facts;
      
      receiving input from the user indicating the potential entity and the stored facts are acceptable; and
      
      generating a new entity in a target data graph representing the potential entity; and
      
      generating relationships in the target data graph for the new entity that represent the stored facts.
  - 18. The method of claim 16, wherein buckets lacking the minimum number of different domains are discarded during the iterative dividing.
  - 19. The method of claim 16, further comprising:
    - removing conflicting facts from the facts about the potential entity prior to storing the potential entity and the facts.
  - 20. The method of claim 16, wherein the minimum number of different domains depends on the number of domains represented by the text documents.
  - 21. The method of claim 16, wherein generating the source data graph includes:
    - generating a plurality of facts from each of the plurality of text documents, the plurality of facts including the facts about the potential entity; and
      
      clustering the facts by text document, entity type, and entity name.
  - 22. The method of claim 16, wherein verifying the potential entity further includes determining that at least one of the determinative facts has an object that exists in a target data graph.
  - 23. The method of claim 16, wherein each source data graph includes at least one fact with an object entity that exists in the target data graph.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Yakhnenko, Oksana, Vesdapunt, Norases
Primary Examiner(s)
Trujillo, James
Assistant Examiner(s)
Tessema, Aida

Application Number

US14/059,830
Time in Patent Office

1,449 Days
Field of Search

None
US Class Current
CPC Class Codes

G06F 16/285   Clustering or classification

G06F 16/288   Entity relationship models

G06F 16/30   of unstructured textual dat...

G06F 16/9024   Graphs; Linked lists G06F16...

G06N 5/022   Knowledge engineering; Know...

Automatic discovery of new entities using graph reconciliation

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

41 Citations

24 Claims

Specification

Use Cases

Quick Links

Others

Automatic discovery of new entities using graph reconciliation

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

41 Citations

24 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others