Data mining platform for bioinformatics and other knowledge discovery
First Claim
1. A data mining platform for generating an output comprising knowledge from analysis of a plurality of biological data sets, wherein the data sets include heterogeneous data types or the data sets come from heterogeneous data sources, the platform comprising:
- a computer system programmed to implement a plurality of modules stored within a system memory, each module configured for processing one data type of the plurality of heterogeneous data types, each module comprising an input data source, a data analysis engine, a data output and a web server connection for each of the input data source, the data analysis engine and the data output, wherein the computer system comprises at least one processor for executing as part of the data analysis engine of each module one or more support vector machines for generating a plurality of classes of data and at least one margin between classes;
a web server connected to the web server connection of each module for communicating with each of the input data source, the data analysis engine and the data output of the corresponding module and for providing means for monitoring one or more of the input data source, the data analysis engine, and the data output;
a combined data analysis engine in communication with the web server for combining the data output from the plurality of modules to generate a single output representing results obtained from analyzing the plurality of heterogeneous data types; and
a graphical user interface for receiving the results and generating at a printer or display device a report of organized results;
wherein the at least one processor executes multiple iterations of a feature subset ranking algorithm on a plurality of data sets comprising one or more of sub-samples of the same data set, multiple data sets of heterogeneous data types, and heterogeneous data sources, to produce ranked lists of feature subsets;
wherein the heterogeneous data types comprise one or more data types selected from the group consisting of gene expression data, 2-D gel data, mass spectrometry data, antibody screening data, clinical observations, clinical history, physical and chemical measurements, genomic determinations, proteomic determinations, drug levels, hormonal and immunological tests, neurochemical or neurophysical measurements, mineral and vitamin level determinations, and genetic and familial histories, and wherein the heterogeneous data sources comprises one or more data sources selected from the group consisting of sensor instruments for collection of genomic data, sensor instruments for collection of proteomic data, sensor instruments for collection of physical and chemical measurements, clinical record databases, general internet search engines, on-line genetic databases, on-line proteomic databases, and on-line journals; and
wherein the ranked lists of feature subsets comprise lists of genes or proteins.
3 Assignments
0 Petitions
Accused Products
Abstract
The data mining platform comprises a plurality of system modules, each formed from a plurality of components. Each module has an input data component, a data analysis engine for processing the input data, an output data component for outputting the results of the data analysis, and a web server to access and monitor the other modules within the unit and to provide communication to other units. Each module processes a different type of data, for example, a first module processes microarray (gene expression) data while a second module processes biomedical literature on the Internet for information supporting relationships between genes and diseases and gene functionality. In the preferred embodiment, the data analysis engine is a kernel-based learning machine, and in particular, one or more support vector machines (SVMs). The data analysis engine includes a pre-processing function for feature selection, for reducing the amount of data to be processed by selecting the optimum number of attributes, or “features”, relevant to the information to be discovered.
118 Citations
22 Claims
-
1. A data mining platform for generating an output comprising knowledge from analysis of a plurality of biological data sets, wherein the data sets include heterogeneous data types or the data sets come from heterogeneous data sources, the platform comprising:
-
a computer system programmed to implement a plurality of modules stored within a system memory, each module configured for processing one data type of the plurality of heterogeneous data types, each module comprising an input data source, a data analysis engine, a data output and a web server connection for each of the input data source, the data analysis engine and the data output, wherein the computer system comprises at least one processor for executing as part of the data analysis engine of each module one or more support vector machines for generating a plurality of classes of data and at least one margin between classes; a web server connected to the web server connection of each module for communicating with each of the input data source, the data analysis engine and the data output of the corresponding module and for providing means for monitoring one or more of the input data source, the data analysis engine, and the data output; a combined data analysis engine in communication with the web server for combining the data output from the plurality of modules to generate a single output representing results obtained from analyzing the plurality of heterogeneous data types; and a graphical user interface for receiving the results and generating at a printer or display device a report of organized results; wherein the at least one processor executes multiple iterations of a feature subset ranking algorithm on a plurality of data sets comprising one or more of sub-samples of the same data set, multiple data sets of heterogeneous data types, and heterogeneous data sources, to produce ranked lists of feature subsets; wherein the heterogeneous data types comprise one or more data types selected from the group consisting of gene expression data, 2-D gel data, mass spectrometry data, antibody screening data, clinical observations, clinical history, physical and chemical measurements, genomic determinations, proteomic determinations, drug levels, hormonal and immunological tests, neurochemical or neurophysical measurements, mineral and vitamin level determinations, and genetic and familial histories, and wherein the heterogeneous data sources comprises one or more data sources selected from the group consisting of sensor instruments for collection of genomic data, sensor instruments for collection of proteomic data, sensor instruments for collection of physical and chemical measurements, clinical record databases, general internet search engines, on-line genetic databases, on-line proteomic databases, and on-line journals; and wherein the ranked lists of feature subsets comprise lists of genes or proteins. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. A computer program product embodied on a computer readable medium for discovering knowledge from analysis of a plurality of biological data sets, wherein the data sets include heterogeneous data types of the data sets come from heterogeneous data sources, the computer program product comprising instructions for executing support vector machine classifiers and further for causing a computer processor to:
-
receive data from the plurality of biological data sets and; (a) implement a plurality of modules stored within a system memory, each module configured for processing one data type of the plurality of heterogeneous data types, each module comprising an input data source, a data analysis engine, a data output and a web server connection for each of the input data source, the data analysis engine and the data output, wherein the computer system comprises at least one processor for executing as part of the data analysis engine of each module one or more support vector machines for generating a plurality of classes of data and at least one margin between classes; (b) connect between a web server and the web server connection of each module for communicating with each of the input data source, the data analysis engine and the data output of the corresponding module and for providing means for monitoring one or more of the input data source, the data analysis engine, and the data output; (c) combine the data output from the plurality of modules to generate a single output representing results obtained from analyzing the plurality of heterogeneous data types; and (d) generate a display of organized analysis results; wherein the at least one processor executes multiple iterations of a feature subset ranking algorithm on a plurality of data sets comprising one or more of sub-samples of the same data set, multiple data sets of heterogeneous data types, and heterogeneous data sources, to produce ranked lists of feature subsets; wherein the heterogeneous data types comprise one or more data types selected from the group consisting of gene expression data, mass spectrometry data, 2-D gel data, antibody screening, clinical observations, clinical history, physical and chemical measurements, genomic determinations, proteomic determinations, drug levels, hormonal and immunological tests, neurochemical or neurophysical measurements, mineral and vitamin level determinations, and genetic and familial histories, and wherein the heterogeneous data sources comprises one or more data sources selected from the group consisting of sensor instruments for collection of genomic data, sensor instruments for collection of proteomic data, sensor instruments for collection of physical and chemical measurements, clinical record databases, general internet search engines, on-line genetic databases, on-line proteomic databases, and on-line journals; and wherein the ranked lists of feature subsets comprise lists of genes or proteins. - View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20, 21, 22)
-
Specification