DATA PROCESSING

US 20170286489A1
Filed: 03/30/2016
Published: 10/05/2017
Est. Priority Date: 03/30/2016
Status: Active Grant

First Claim

Patent Images

1. A method, said method comprising:

identifying, by one or more processors of a computer system, a plurality of entities within a first data source;

for each entity identified within the first data source, said one or more processors identifying within the first data source attributes of the entity identified within the first data source and/or relationships between the entity identified within the first data source and other entities identified within the first data source, and associating the attributes and/or relationships identified within the first data source with a first entity identified within a data structure;

said one or more processors generating, for each entity identified within the first data source, a frequency metric characterizing the entity identified within the first data source, said frequency metric based on a frequency at which each attribute and/or relationship identified within the first data source is associated with the entity identified within the first data source, andsaid one or more processors identifying a degree of similarity between two entities of the plurality of entities by comparing the respective frequency metrics of the two entities.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and associated system. Entities within a first data source are identified. For each entity identified within the first data source, attributes of the entity identified within the first data source and/or relationships between the entity identified within the first data source and other entities identified within the first data source are identified. The attributes and/or relationships identified within the first data source are associated with a first entity identified within a data structure. For each entity identified within the first data source, a frequency metric characterizing the entity identified within the first data source is generated. The frequency metric is based on a frequency at which each attribute and/or relationship identified within the first data source is associated with the entity identified within the first data source. A degree of similarity between two entities of the entities is identified, by comparing the frequency metrics of the two entities.

12 Citations

View as Search Results

25 Claims

1. A method, said method comprising:
- identifying, by one or more processors of a computer system, a plurality of entities within a first data source;
  
  for each entity identified within the first data source, said one or more processors identifying within the first data source attributes of the entity identified within the first data source and/or relationships between the entity identified within the first data source and other entities identified within the first data source, and associating the attributes and/or relationships identified within the first data source with a first entity identified within a data structure;
  
  said one or more processors generating, for each entity identified within the first data source, a frequency metric characterizing the entity identified within the first data source, said frequency metric based on a frequency at which each attribute and/or relationship identified within the first data source is associated with the entity identified within the first data source, andsaid one or more processors identifying a degree of similarity between two entities of the plurality of entities by comparing the respective frequency metrics of the two entities.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23)
- - 2. The method of claim 1, said method comprising:
    - said one or more processors associating the two entities within the data structure in response to a determination that an identified degree of similarity between the two entities is greater than a first predetermined threshold.
  - 3. The method of claim 1, said method comprising:
    - said one or more processors identifying one or more entities within a second data source;
      
      for each entity identified in the second data source, said one or more processors identifying within the second data source attributes and/or relationships of the entity identified within the second data source, and associating the attributes and/or entities identified in the second data source with the first entity identified within the data structure;
      
      generating, for each entity identified in the second data source, a frequency metric characterizing the entity identified in the second data source based on a frequency at which each attribute and/or relationship identified within the second data source is associated with the entity identified within the second data source;
      
      wherein a degree of similarity between an entity in the first data source and an entity in the second data source is identified by comparing the respective frequency metrics of the two entities.
  - 4. The method according to claim 1, wherein the frequency metric characterizing the entity identified within the first data source represents a degree of association between the entity identified within the first data source and the attributes and/or relationships identified within the first data source.
  - 5. The method of claim 4, wherein the frequency metric characterizing the entity identified within the first data source is based on triples describing the association between the entity identified within the first data source and attributes and/or relationships associated with the entity identified within the first data source.
  - 6. The method of claim 5, wherein the frequency metric characterizing the entity identified within the first data source is a term-frequency, inverse-document-frequency (TF-IDF) metric modified to handle the triples.
  - 7. The method of claim 1, wherein said identifying the degree of similarity between the two entities comprises using a cosine distance computation between the respective frequency metrics of the two entities.
  - 8. The method of claim 1, wherein said identifying the plurality of entities within the first data source comprises defining a set of entities to be searched for in the first data source.
  - 9. The method of claim 1, wherein said identifying attributes of an entity identified within the first data source comprises decomposing text of the data source into an entity, relationship and attribute triple, wherein the relationship is the relationship between the entity identified within the first data source and the attribute, or between the entity identified within the first data source and another entity identified within the first data source.
  - 10. The method of claim 1, said method comprising:
    - said one or more processors providing a facility for a user to confirm an association between the two entities, or between an entity identified within the first data source and an attribute identified within the first data source.
  - 11. The method of claim 1, said method comprising:
    - said one or more processors providing a facility for a user to remove an association between the two entities, or between an entity identified within the first data source and an attribute of the entity identified within the first data source.
  - 12. The method of claim 1, said method comprising:
    - said one or more processors providing a facility for a user to manually associate an attribute with an entity identified within the first data source.
  - 13. The method of claim 1, said method comprising:
    - said one or more processors providing a facility for a user to combine the two entities together in the data structure, wherein attributes of both entities of the two entities are associated with the combined two entities in the data structure.
  - 14. The method of claim 13, said method comprising:
    - said one or more processors calculating a frequency metric for the combined two entities based on a frequency at which each attribute of the combined two entities is associated with the combined two entities.
  - 15. The method of claim 1, said method comprising:
    - said one or more processors combining the two entities into a single entity in response to a determination that an identified degree of similarity between the two entities is greater than a second predetermined threshold.
  - 16. The method of claim 1, said method comprising:
    - said one or more processors associating the two entities with each other in response to a determination that an identified degree of similarity between the two entities is greater than a second predetermined threshold and the two entities have a same entity name or a similar entity name.
  - 17. The method of claim 1, wherein the said identifying entities within the first data source comprises including the first data source within a natural language algorithm.
  - 18. The method of claim 1, said method comprising:
    - said one or more processors displaying a representation of the data structure to identify to a user associations between entities within the data structure.
  - 19. The method of claim 18, wherein the associations between entities within the data structure are displayed in response to a determination that the degree of similarity between the two entities is greater than a third predetermined threshold.
  - 20. The method of claim 18, wherein the representation of the data structure identifies to the user associations between the entities within the data structure and attributes of the entities within the data structure.
  - 21. The method of claim 1, said method comprising:
    - said one or more processors providing a facility for a user to manually input text data to be processed as another data source.
  - 22. The method of claim 1, said method comprising:
    - said one or more processors providing a facility for the user to apply a weighting to a first attribute of an entity identified within the first data source to influence the impact of that the first attribute on the frequency metrics characterizing the entity identified within the first data source.
  - 23. The method of claim 1, wherein the first data source is a web page or document.

24. A computer program product, comprising one or more computer readable hardware storage devices having computer readable program code stored therein, said program code containing instructions executable by one or more processors of a computer system to implement a method, said method comprising:
- said one or more processors identifying a plurality of entities within a first data source;
  
  for each entity identified within the first data source, said one or more processors identifying within the first data source attributes of the entity identified within the first data source and/or relationships between the entity identified within the first data source and other entities identified within the first data source, and associating the attributes and/or relationships identified within the first data source with a first entity identified within a data structure;
  
  said one or more processors generating, for each entity identified within the first data source, a frequency metric characterizing the entity identified within the first data source, said frequency metric based on a frequency at which each attribute and/or relationship identified within the first data source is associated with the entity identified within the first data source; and
  
  said one or more processors identifying a degree of similarity between two entities of the plurality of entities by comparing the respective frequency metrics of the two entities.

25. A computer system, comprising one or more processors, one or more memories, and one or more computer readable hardware storage devices, said one or more hardware storage device containing program code executable by the one or more processors via the one or more memories to implement a method, said method comprising:
- said one or more processors identifying a plurality of entities within a first data source;
  
  for each entity identified within the first data source, said one or more processors identifying within the first data source attributes of the entity identified within the first data source and/or relationships between the entity identified within the first data source and other entities identified within the first data source, and associating the attributes and/or relationships identified within the first data source with a first entity identified within a data structure;
  
  said one or more processors generating, for each entity identified within the first data source, a frequency metric characterizing the entity identified within the first data source, said frequency metric based on a frequency at which each attribute and/or relationship identified within the first data source is associated with the entity identified within the first data source; and
  
  said one or more processors identifying a degree of similarity between two entities of the plurality of entities by comparing the respective frequency metrics of the two entities.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Dantressangle, Patrick, Laws, Simon, Ronaghan, Stacey H., Wooldridge, Peter

Granted Patent

US 10,585,893 B2
Time in Patent Office

Days
Field of Search
US Class Current
CPC Class Codes

G06F 16/22 Indexing; Data structures t...

G06F 16/2455 Query execution

DATA PROCESSING

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

12 Citations

25 Claims

Specification

Solutions

Use Cases

Quick Links

DATA PROCESSING

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

12 Citations

25 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links