Knowledge catalysts
First Claim
1. A computer implemented method of integrating data from remote disparate data sources comprising a non-transitory media, comprising programming for:
- detecting data sets in different formats having a plurality of fields hosted in a plurality of remote heterogeneous databases accessible through infrastructures that are coupled through a distributed network;
extracting schema data of the plurality of remote heterogeneous databases;
modeling each position of selected plurality of fields of the plurality of remote heterogeneous databases as a plurality of polynomials,identifying related fields in two or more of the plurality of remote heterogeneous databases by automatically hypothesizing data links based on column features that identify the number of distinct data items in each column in the plurality of remote heterogeneous databases and fuzzy logic matching that compares divergence of the plurality of polynomials; and
linking the related fields automatically in the two or more of the plurality of remote heterogeneous databases through a virtual warehouse.
2 Assignments
0 Petitions
Accused Products
Abstract
A computer implemented method integrates data from remote disparate data sources by processing a non-transitory media. The non-transitory media stores instructions for detecting data sets in different formats hosted in a plurality of heterogeneous databases that are accessible through a distributed network. The method extracts schema data from the plurality of heterogeneous databases and identifies related fields in two or more of the heterogeneous databases. The method links the related fields in the two or more of the plurality of heterogeneous databases and makes the data accessible through a virtual warehouse. As schemas change, as new data sources and analysis artifacts are created, the computer implemented method and system can act as a meta-data store, a provenance tracking device, and/or a knowledge management service.
-
Citations
20 Claims
-
1. A computer implemented method of integrating data from remote disparate data sources comprising a non-transitory media, comprising programming for:
-
detecting data sets in different formats having a plurality of fields hosted in a plurality of remote heterogeneous databases accessible through infrastructures that are coupled through a distributed network; extracting schema data of the plurality of remote heterogeneous databases; modeling each position of selected plurality of fields of the plurality of remote heterogeneous databases as a plurality of polynomials, identifying related fields in two or more of the plurality of remote heterogeneous databases by automatically hypothesizing data links based on column features that identify the number of distinct data items in each column in the plurality of remote heterogeneous databases and fuzzy logic matching that compares divergence of the plurality of polynomials; and linking the related fields automatically in the two or more of the plurality of remote heterogeneous databases through a virtual warehouse. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. A unified data integration system, comprising:
-
one or more processors; a memory; and one or more programs, where the one or more programs are stored in the memory and configured to be executed by the one or more processors, the programs including instructions for automatically; detecting data sets hosted in a plurality of heterogeneous databases that are coupled through a distributed network; extracting schema and relationship data from the plurality heterogeneous databases; generating polynomial models and histogram distributions of a random sample of data sampled from the plurality of heterogeneous databases; generating schema-level hypotheses that makes hypothetical content connections between previously unknown data sources accessed from at least two of the plurality heterogeneous databases to build a virtual schema based on the hypothetical content connections, an identified number of distinct data values in the plurality heterogeneous databases and fuzzy logic string matching that compares divergence between the polynomial models; automatically generating schema crosswalk keys that identify equivalent elements; and validating at least some of the schema-level hypotheses automatically. - View Dependent Claims (13, 14, 15, 16, 17)
-
-
18. A method for analyzing a large data set from disparate sources using a computer, the method comprising:
-
building a virtual schema of the data set in a memory of the computer automatically; modeling selected fields of a plurality of remote heterogeneous databases as polynomials; linking data elements automatically by hypothesizing links based between data elements based on naming conventions of remote data elements, identifying the number of remote distinct data elements, and fuzzy logic means that compares divergence between the polynomials; and displaying the data set visually on a monitor of the computer as an interconnected plurality of local databases and tables automatically. - View Dependent Claims (19, 20)
-
Specification