Techniques for similarity analysis and data enrichment using knowledge sources

US 10,210,246 B2
Filed: 09/24/2015
Issued: 02/19/2019
Est. Priority Date: 09/26/2014
Status: Active Grant

First Claim

Patent Images

1. A method comprising:

receiving, by a cloud computing infrastructure system of a data enrichment system, an input data set from one or more input data sources, wherein the data enrichment system comprises a user experience layer configured to provide access to the data enrichment system and a scheduler service configured to manage requests and responses received through the user experience layer and configured to manage the cloud computing infrastructure system;

comparing, by the cloud computing infrastructure system of the data enrichment system the input data set to one or more reference data sets obtained from a reference source;

computing, by the cloud computing infrastructure system, a similarity metric for each of the one or more reference data sets, the similarity metric indicating a measure of similarity of each of the one or more reference data sets in comparison to the input data set, wherein the similarity metric is a matching score computed for each of the one or more reference data sets with respect to the input data set, and wherein the similarity metric is computed as a value based on cardinality of an intersection of the one or more reference data sets in comparison to the input data set wherein the value is normalized by the cardinality, and wherein the value is reduced by a first factor based on a size of the one or more reference data sets, and the value is reduced by a second factor based on a type of the one or more reference data sets;

identifying, by the cloud computing infrastructure system, a match between the input data set and the one or more reference data sets based on the similarity metric;

generating, by an interactive visualization system of the cloud computing infrastructure system, an interactive graphical interface that indicates the similarity metric computed for each of the one or more reference data sets and that indicates the match identified between the input data set and the one or more reference data sets in order to visually identify the one or more reference data sets having a highest similarity metric- with respect to the input data set; and

rendering, using the interactive graphical interface, a graphical visualization that indicates the similarity metric computed for each of the one or more reference data sets and that indicates the match identified between the input data set and the one or more reference data sets in order to identify the matching one or more reference data sets in order to perform large scale data enrichment while reducing load on resources of the cloud computing infrastructure system.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The present disclosure relates to performing similarity metric analysis and data enrichment using knowledge sources. A data enrichment service can compare an input data set to reference data sets stored in a knowledge source to identify similarly related data. A similarity metric can be calculated corresponding to the semantic similarity of two or more datasets. The similarity metric can be used to identify datasets based on their metadata attributes and data values enabling easier indexing and high performance retrieval of data values. A input data set can labeled with a category based on the data set having the best match with the input data set. The similarity of an input data set with a data set provided by a knowledge source can be used to query a knowledge source to obtain additional information about the data set. The additional information can be used to provide recommendations to the user.

Citations

19 Claims

1. A method comprising:
- receiving, by a cloud computing infrastructure system of a data enrichment system, an input data set from one or more input data sources, wherein the data enrichment system comprises a user experience layer configured to provide access to the data enrichment system and a scheduler service configured to manage requests and responses received through the user experience layer and configured to manage the cloud computing infrastructure system;
  
  comparing, by the cloud computing infrastructure system of the data enrichment system the input data set to one or more reference data sets obtained from a reference source;
  
  computing, by the cloud computing infrastructure system, a similarity metric for each of the one or more reference data sets, the similarity metric indicating a measure of similarity of each of the one or more reference data sets in comparison to the input data set, wherein the similarity metric is a matching score computed for each of the one or more reference data sets with respect to the input data set, and wherein the similarity metric is computed as a value based on cardinality of an intersection of the one or more reference data sets in comparison to the input data set wherein the value is normalized by the cardinality, and wherein the value is reduced by a first factor based on a size of the one or more reference data sets, and the value is reduced by a second factor based on a type of the one or more reference data sets;
  
  identifying, by the cloud computing infrastructure system, a match between the input data set and the one or more reference data sets based on the similarity metric;
  
  generating, by an interactive visualization system of the cloud computing infrastructure system, an interactive graphical interface that indicates the similarity metric computed for each of the one or more reference data sets and that indicates the match identified between the input data set and the one or more reference data sets in order to visually identify the one or more reference data sets having a highest similarity metric- with respect to the input data set; and
  
  rendering, using the interactive graphical interface, a graphical visualization that indicates the similarity metric computed for each of the one or more reference data sets and that indicates the match identified between the input data set and the one or more reference data sets in order to identify the matching one or more reference data sets in order to perform large scale data enrichment while reducing load on resources of the cloud computing infrastructure system.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The method of claim 1, wherein the one or more reference data sets includes terms associated with a domain, and wherein the matching score is computed using one or more values including a first value indicating a metric about the one or more reference data sets and a second value indicating a metric based un comparing the input data set to the one or more reference data sets.
  - 3. The method of claim 2, wherein the graphical visualization is rendered to indicate the one or more values used to compute the matching score.
  - 4. The method of claim 2, wherein the one or more values includes a frequency value of terms matching between the input data set and the reference data set, a population value of the reference data set, unique matching value that indicates a number of different terms matching between the input data set and the reference data set, a domain value indicating a number of terms in the reference data set, and a curation level indicating a degree of curation of the reference data set.
  - 5. The method of claim 1, further comprising:
    - generating, by the cloud computing infrastructure system, an augmentation list based on augmentation data obtained from an aggregation service; and
      
      augmenting the input data set based on the augmentation list,wherein the input data compared to the one or more reference data sets is augmented based on the augmentation list.
  - 6. The method of claim 5, further comprising:
    - generating, by the cloud computing infrastructure system, an indexed trigram table based on the one or more reference data sets;
      
      for each word in the input data set after augmentation;
      
      creating trigrams for the word;
      
      comparing each of the trigrams to the indexed trigram table;
      
      identifying a word in the indexed trigram table associated with a trigram that matches a first trigram in the trigrams; and
      
      storing the word in a trigram augmented data set;
      
      comparing the trigram augmented data set to the one or more reference data sets;
      
      determining a match between the trigram augmented data set and the one or more reference data sets based on the comparing; and
      
      wherein identifying the match between the input data set and the one or more reference data sets is performed using the match between the trigram augmented data set and the one or more reference data sets based on the comparing.
  - 7. The method of claim 1, further comprising:
    - generating a data structure that represents at least a portion of the one or more reference data sets, wherein each node in the data structure represents a different character in one or more strings extracted from the one or more reference data sets; and
      
      wherein the input data set is compared to the one or more reference data sets by traversing the data structure.
  - 8. The method of claim 1, wherein the similarity metric is computed for each reference data set of the one or more reference data sets by determining a cosine similarity between the input data set and the reference data set.
  - 9. The method of claim 1, wherein identifying the match includes determining a reference data of the one or more reference data sets having a highest measure of similarity based on the similarity metric computed for each of the one or more reference data sets.
  - 10. The method of claim 1, wherein the input data set is formatted into one or more columns of data.
  - 11. The method of claim 1, wherein the reference source is a knowledge source provided by a knowledge service.

12. A data enrichment system comprising:
- a plurality of input data sources; and
  
  a cloud computing infrastructure system comprising;
  
  one or more processors communicatively coupled to the plurality of input data sources and communicatively coupled to a plurality of data targets, over at least one communication network; and
  
  a memory coupled to the one or more processors, the memory storing instructions to provide a data enrichment system, wherein the instructions, when executed by the one or more processors, cause the one or more processors to;
  
  receive an input data set from one or more of the plurality of input data sources;
  
  compare the input data set to one or more reference data sets obtained from a reference source;
  
  compute a similarity metric for each of the one or more reference data sets, the similarity metric indicating a measure of similarity of each of the one or more reference data sets in comparison to the input data set, wherein the similarity metric is a matching score computed for each of the one or more reference data sets with respect to the input data set, wherein the similarity metric is computed as a value based on cardinality of an intersection of the one or more reference data sets in comparison to the input data set, wherein the value is normalized by the cardinality, and wherein the value is reduced by a first factor based on a size of the one or more reference data sets, and the value is reduced by a second factor based on a type of the one or more reference data sets;
  
  identify a match between the input data set and the one or more reference data sets based on the similarity metric;
  
  generate, by an interactive visualization system of the cloud computing infrastructure system, an interactive graphical interface that indicates the similarity metric computed for each of the one or more reference data sets and that indicates the match identified between the input data set and the one or more reference data sets in order to visually identify the one or more reference data sets having a highest similarity metric with respect to the input data set; and
  
  render, by the interactive visualization system, a graphical visualization that indicates the similarity metric computed for each of the one or more reference data sets and that indicates the match identified between the input data set and the one or more reference data sets in order to identify the matching one or more reference data sets in order to perform large scale data enrichment,a user experience layer configured to provide access to the data enrichment system; and
  
  a scheduler service configured to manage requests and responses received through the user experience layer and configured to manage the cloud computing infrastructure system while reducing load on resources of the cloud computing infrastructure system.
- View Dependent Claims (13, 14, 15)
- - 13. The data enrichment system of claim 12, wherein the one or more reference data sets includes terms associated with a domain, wherein the matching score is computed using one or more values including a first value indicating a metric about the one or more reference data sets and a second value indicating a metric based on comparing the input data set to the one or more reference data sets, and wherein the graphical visualization is rendered to indicate the one or more values used to compute the matching score.
  - 14. The data enrichment system of claim 13, wherein the one or more values includes a frequency value of terms matching between the input data set and the reference data set, a population value of the reference data set, unique matching value that indicates a number of different terms matching between the input data set and the reference data set, a domain value indicating a number of terms in the reference data set, and a curation level indicating a degree of curation of the reference data set.
  - 15. The data enrichment system of claim 12, wherein the instructions, when executed by the one or more processors, further cause the one or more processors to:
    - generate an augmentation list based on augmentation data obtained from an aggregation service;
      
      augment the input data set based on the augmentation list;
      
      generate an indexed trigram table based on the one or more reference data sets;
      
      for each word in the input data set after augmentation;
      
      create trigrams for the word;
      
      compare each of the trigrams to the indexed trigram table;
      
      identify a word in the indexed trigram table associated with a trigram that matches a first trigram in the trigrams; and
      
      store the word in a trigram augmented data set;
      
      compare the trigram augmented data set to the one or more reference data sets; and
      
      determine a match between the trigram augmented data set and the one or more reference data sets based on the comparing; and
      
      wherein the input data compared to the one or more reference data sets is augmented based on the augmentation list, andwherein identifying the match between the input data set and the one or more reference data sets is performed using the match between the trigram augmented data set and the one or more reference data sets based on the comparing.

16. A method comprising:
- receiving an input data set from one or more input data sources;
  
  comparing, by a cloud computing infrastructure system of a data enrichment system, the input data set to one or more reference data sets obtained from a reference source, wherein the data enrichment system comprises a user experience layer configured to provide access to the data enrichment system and a scheduler service configured to manage requests and responses received through the user experience layer and configured to manage the cloud computing infrastructure system;
  
  computing, by the cloud computing infrastructure system, a similarity metric for each of the one or more reference data sets, the similarity metric indicating a measure of similarity of each of the one or more reference data sets in comparison to the input data set, wherein the similarity metric is a matching score computed for each of the one or more reference data sets with respect to the input data set wherein the similarity metric is computed as a value based on cardinality of an intersection of the one or more reference data sets in comparison to the input data set, wherein the value is normalized by the cardinality, and wherein the value is reduced by a first factor based on a size of the one or more reference data sets and the value is reduced by a second factor based on a type of the one or more reference data sets;
  
  identifying, by an interactive visualization system of the cloud computing infrastructure system, a match between the input data set and the one or more reference data sets based on the similarity metric in order to identify the one or more reference data sets having a highest similarity metric with respect to the input data set; and
  
  storing the input data set with matching information that indicates the similarity metric computed for each of the one or more reference data sets and that indicates the match identified between the input data set and the one or more reference data sets in order to perform large scale data enrichment while reducing load on resources of the cloud computing infrastructure system.
- View Dependent Claims (17, 18, 19)
- - 17. The method of claim 16, further comprising:
    - identifying a category label for the input data set based on identifying the match between the input data set and the one or more reference data sets; and
      
      storing the input data set in association with the category label.
  - 18. The method of claim 16, wherein the similarity metric is computed using one or more of a Jaccard Index, a Tversky Index, or a Dice-Sorensen Index.
  - 19. The method of claim 16, wherein the input data set is compared to the one or more reference data sets using one or more of graph matching or semantic similarity matching.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Oracle International Corporation (Oracle Corporation)
Original Assignee
Oracle International Corporation (Oracle Corporation)
Inventors
Stojanovic, Alexander Sasha, Kreider, Mark, Malak, Michael, Murray, Glenn Allen
Primary Examiner(s)
Beausoliel, Jr., Robert W
Assistant Examiner(s)
Khakhar, Nirav K

Application Number

US14/864,485
Publication Number

US 20160092557A1
Time in Patent Office

1,244 Days
Field of Search
US Class Current
CPC Class Codes

G06F 16/215   Improving data quality; Dat...

G06F 16/248   Presentation of query results

G06F 16/254   Extract, transform and load...

G06F 16/285   Clustering or classification

G06F 16/334   Query execution G06F16/335 ...

G06F 16/35   Clustering; Classification

G06F 16/355   Class or cluster creation o...

G06F 16/9024   Graphs; Linked lists G06F16...

G06Q 30/02   Marketing; Price estimation...

G06Q 30/0631   Item recommendations

Techniques for similarity analysis and data enrichment using knowledge sources

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

19 Claims

Specification

Solutions

Use Cases

Quick Links

Techniques for similarity analysis and data enrichment using knowledge sources

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

19 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links