Techniques for similarity analysis and data enrichment using knowledge sources
First Claim
1. A method comprising:
- receiving, by a cloud computing infrastructure system of a data enrichment system, an input data set from one or more input data sources, wherein the data enrichment system comprises a user experience layer configured to provide access to the data enrichment system and a scheduler service configured to manage requests and responses received through the user experience layer and configured to manage the cloud computing infrastructure system;
comparing, by the cloud computing infrastructure system of the data enrichment system the input data set to one or more reference data sets obtained from a reference source;
computing, by the cloud computing infrastructure system, a similarity metric for each of the one or more reference data sets, the similarity metric indicating a measure of similarity of each of the one or more reference data sets in comparison to the input data set, wherein the similarity metric is a matching score computed for each of the one or more reference data sets with respect to the input data set, and wherein the similarity metric is computed as a value based on cardinality of an intersection of the one or more reference data sets in comparison to the input data set wherein the value is normalized by the cardinality, and wherein the value is reduced by a first factor based on a size of the one or more reference data sets, and the value is reduced by a second factor based on a type of the one or more reference data sets;
identifying, by the cloud computing infrastructure system, a match between the input data set and the one or more reference data sets based on the similarity metric;
generating, by an interactive visualization system of the cloud computing infrastructure system, an interactive graphical interface that indicates the similarity metric computed for each of the one or more reference data sets and that indicates the match identified between the input data set and the one or more reference data sets in order to visually identify the one or more reference data sets having a highest similarity metric- with respect to the input data set; and
rendering, using the interactive graphical interface, a graphical visualization that indicates the similarity metric computed for each of the one or more reference data sets and that indicates the match identified between the input data set and the one or more reference data sets in order to identify the matching one or more reference data sets in order to perform large scale data enrichment while reducing load on resources of the cloud computing infrastructure system.
1 Assignment
0 Petitions
Accused Products
Abstract
The present disclosure relates to performing similarity metric analysis and data enrichment using knowledge sources. A data enrichment service can compare an input data set to reference data sets stored in a knowledge source to identify similarly related data. A similarity metric can be calculated corresponding to the semantic similarity of two or more datasets. The similarity metric can be used to identify datasets based on their metadata attributes and data values enabling easier indexing and high performance retrieval of data values. A input data set can labeled with a category based on the data set having the best match with the input data set. The similarity of an input data set with a data set provided by a knowledge source can be used to query a knowledge source to obtain additional information about the data set. The additional information can be used to provide recommendations to the user.
-
Citations
19 Claims
-
1. A method comprising:
-
receiving, by a cloud computing infrastructure system of a data enrichment system, an input data set from one or more input data sources, wherein the data enrichment system comprises a user experience layer configured to provide access to the data enrichment system and a scheduler service configured to manage requests and responses received through the user experience layer and configured to manage the cloud computing infrastructure system; comparing, by the cloud computing infrastructure system of the data enrichment system the input data set to one or more reference data sets obtained from a reference source; computing, by the cloud computing infrastructure system, a similarity metric for each of the one or more reference data sets, the similarity metric indicating a measure of similarity of each of the one or more reference data sets in comparison to the input data set, wherein the similarity metric is a matching score computed for each of the one or more reference data sets with respect to the input data set, and wherein the similarity metric is computed as a value based on cardinality of an intersection of the one or more reference data sets in comparison to the input data set wherein the value is normalized by the cardinality, and wherein the value is reduced by a first factor based on a size of the one or more reference data sets, and the value is reduced by a second factor based on a type of the one or more reference data sets; identifying, by the cloud computing infrastructure system, a match between the input data set and the one or more reference data sets based on the similarity metric; generating, by an interactive visualization system of the cloud computing infrastructure system, an interactive graphical interface that indicates the similarity metric computed for each of the one or more reference data sets and that indicates the match identified between the input data set and the one or more reference data sets in order to visually identify the one or more reference data sets having a highest similarity metric- with respect to the input data set; and rendering, using the interactive graphical interface, a graphical visualization that indicates the similarity metric computed for each of the one or more reference data sets and that indicates the match identified between the input data set and the one or more reference data sets in order to identify the matching one or more reference data sets in order to perform large scale data enrichment while reducing load on resources of the cloud computing infrastructure system. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. A data enrichment system comprising:
-
a plurality of input data sources; and a cloud computing infrastructure system comprising; one or more processors communicatively coupled to the plurality of input data sources and communicatively coupled to a plurality of data targets, over at least one communication network; and a memory coupled to the one or more processors, the memory storing instructions to provide a data enrichment system, wherein the instructions, when executed by the one or more processors, cause the one or more processors to; receive an input data set from one or more of the plurality of input data sources; compare the input data set to one or more reference data sets obtained from a reference source; compute a similarity metric for each of the one or more reference data sets, the similarity metric indicating a measure of similarity of each of the one or more reference data sets in comparison to the input data set, wherein the similarity metric is a matching score computed for each of the one or more reference data sets with respect to the input data set, wherein the similarity metric is computed as a value based on cardinality of an intersection of the one or more reference data sets in comparison to the input data set, wherein the value is normalized by the cardinality, and wherein the value is reduced by a first factor based on a size of the one or more reference data sets, and the value is reduced by a second factor based on a type of the one or more reference data sets; identify a match between the input data set and the one or more reference data sets based on the similarity metric; generate, by an interactive visualization system of the cloud computing infrastructure system, an interactive graphical interface that indicates the similarity metric computed for each of the one or more reference data sets and that indicates the match identified between the input data set and the one or more reference data sets in order to visually identify the one or more reference data sets having a highest similarity metric with respect to the input data set; and render, by the interactive visualization system, a graphical visualization that indicates the similarity metric computed for each of the one or more reference data sets and that indicates the match identified between the input data set and the one or more reference data sets in order to identify the matching one or more reference data sets in order to perform large scale data enrichment, a user experience layer configured to provide access to the data enrichment system; and a scheduler service configured to manage requests and responses received through the user experience layer and configured to manage the cloud computing infrastructure system while reducing load on resources of the cloud computing infrastructure system. - View Dependent Claims (13, 14, 15)
-
-
16. A method comprising:
-
receiving an input data set from one or more input data sources; comparing, by a cloud computing infrastructure system of a data enrichment system, the input data set to one or more reference data sets obtained from a reference source, wherein the data enrichment system comprises a user experience layer configured to provide access to the data enrichment system and a scheduler service configured to manage requests and responses received through the user experience layer and configured to manage the cloud computing infrastructure system; computing, by the cloud computing infrastructure system, a similarity metric for each of the one or more reference data sets, the similarity metric indicating a measure of similarity of each of the one or more reference data sets in comparison to the input data set, wherein the similarity metric is a matching score computed for each of the one or more reference data sets with respect to the input data set wherein the similarity metric is computed as a value based on cardinality of an intersection of the one or more reference data sets in comparison to the input data set, wherein the value is normalized by the cardinality, and wherein the value is reduced by a first factor based on a size of the one or more reference data sets and the value is reduced by a second factor based on a type of the one or more reference data sets; identifying, by an interactive visualization system of the cloud computing infrastructure system, a match between the input data set and the one or more reference data sets based on the similarity metric in order to identify the one or more reference data sets having a highest similarity metric with respect to the input data set; and storing the input data set with matching information that indicates the similarity metric computed for each of the one or more reference data sets and that indicates the match identified between the input data set and the one or more reference data sets in order to perform large scale data enrichment while reducing load on resources of the cloud computing infrastructure system. - View Dependent Claims (17, 18, 19)
-
Specification