Method for enhancing knowledge discovered from biological data using a learning machine
DCFirst Claim
1. A method for enhancing knowledge discovered from biological data using a learning machine comprising the steps of:
- a) pre-processing a training data set derived from biological data to expand each of a plurality of training data points;
training the learning machine using the pre-processed training data set;
pre-processing a test data set derived from biological data to expand each of a plurality of test data points;
testing the trained learning machine using the pre-processed test data set to generate a test output; and
in response to receiving the test output of the trained learning machine, post-processing the test output to determine if the knowledge discovered from the pre-processed test data set is desirable.
7 Assignments
Litigations
0 Petitions
Accused Products
Abstract
A learning machine is used to extract useful information from vast quantities of biological data. The method includes pre-processing of training data and test data to add dimensionality or to identify missing or erroneous data points. The training data is used to train the learning machine after which the success of the training is tested using the test data. The test output is pre-processed to determine whether the knowledge discovered from the pre-processed test data set is desirable. After the training has been confirmed, live biological data can be pre-processed then input into the trained learning machine for extraction of useful information. In the preferred embodiment, the learning machine is one or more support vector machines.
-
Citations
51 Claims
-
1. A method for enhancing knowledge discovered from biological data using a learning machine comprising the steps of:
-
a) pre-processing a training data set derived from biological data to expand each of a plurality of training data points;
training the learning machine using the pre-processed training data set;
pre-processing a test data set derived from biological data to expand each of a plurality of test data points;
testing the trained learning machine using the pre-processed test data set to generate a test output; and
in response to receiving the test output of the trained learning machine, post-processing the test output to determine if the knowledge discovered from the pre-processed test data set is desirable. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
wherein adding dimensionality to each training data point comprises adding one or more new coordinates to the vector.
-
-
4. The method of claim 3, wherein the new coordinate added to the vector is derived by applying a transformation to one of the original coordinates.
-
5. The method of claim 4, wherein the transformation is based on expert knowledge.
-
6. The method of claim 4, wherein the transformation is computationally derived.
-
7. The method of claim 4, wherein the training data set comprises a continuous variable;
- and
wherein the transformation comprises optimally categorizing the continuous variable of training data set.
- and
-
8. The method of claim 1, wherein post-processing the test output comprises interpreting the test output into a format that may be compared with the plurality of test data points.
-
9. The method of claim 1, wherein the knowledge to be discovered from the data relates to a regression or density estimation;
- and
wherein post-processing the test output comprises optimally categorizing the test output to derive cutoff points in the continuous variable.
- and
-
10. The method of claim 1, wherein the knowledge to be discovered from the data relates to a regression or density estimation;
-
wherein the training output comprises a continuous variable; and
wherein the method further comprises the steps of;
in response to training the learning machine, receiving a training output from the learning machine, and post-processing the training output by optimally categorizing the test output to derive cutoff points in the continuous variable.
-
-
11. The method of claim 1, wherein the knowledge discovered from biological data comprises diagnosis or prognosis of a disease state.
-
12. The method of claim 1, wherein the knowledge discovered from biological data comprises efficacy of treatment of a disease state.
-
13. A method for enhancing knowledge discovered from biological data using a support vector machine comprising the steps of:
-
pre-processing a training data set derived from biological data to add meaning to each of a plurality of training data points;
training the support vector machine using the pre-processed training data set;
pre-processing a test data set derived from biological data to expand each of a plurality of test data points;
testing the trained support vector machine using the pre-processed test data set to generate a test output; and
in response to receiving the test output of the trained support vector machine, post-processing the test output to determine if the test output is an optimal solution. - View Dependent Claims (14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28)
wherein pre-processing the training data set to add meaning to each training data point comprises;
determining that the training data point is dirty; and
in response to determining that the training data point is dirty, cleaning the training data point.
-
-
15. The method of claim 14, wherein cleaning the training data point comprises deleting, repairing or replacing the data point.
-
16. The method of claim 13, wherein each training data point comprises a vector having one or more original coordinates;
- and
wherein pre-processing the training data set to add meaning to each training data point comprises adding dimensionality to each training data point by adding one or more new coordinates to the vector.
- and
-
17. The method of claim 16, wherein the one or more new coordinates added to the vector are derived by applying a transformation to one or more of the original coordinates.
-
18. The method of claim 17, wherein the transformation is based on expert knowledge.
-
19. The method of claim 17, wherein the transformation is computationally derived.
-
20. The method of claim 17, wherein the training data set comprises a continuous variable;
- and
wherein the transformation comprises optimally categorizing the continuous variable of the training data set.
- and
-
21. The method of claim 13, wherein post-processing the test output comprises interpreting the test output into a format that may be compared with the test data set.
-
22. The method of claim 13, wherein the knowledge to be discovered from the data relates to a regression or density estimation;
-
wherein a training output comprises a continuous variable; and
wherein the method further comprises the step of post-processing the training output by optimally categorizing the training output to derive cutoff points in the continuous variable.
-
-
23. The method of claim 13, further comprising the steps of:
-
selecting a kernel for the support vector machine prior to training the support vector machine;
in response to post-processing the test output, determining that the test output is not the optimal solution;
adjusting the selection of the kernel; and
in response to adjusting the selection of the kernel, retraining and retesting the support vector machine.
-
-
24. The method of claim 23, wherein the selection of a kernel is based on prior performance or historical data and is dependent on the nature of the knowledge to be discovered from the data or the nature of the data.
-
25. The method of claim 13, further comprising the steps of:
-
in response to post-processing the test output, determining that the test output is the optimal solution;
collecting a live data set;
pre-processing the live data set to expand each of a plurality of live data points;
inputting the pre-processed live data set to the support vector machine for processing; and
receiving the live output of the trained support vector machine.
-
-
26. The method of claim 25, further comprising the step post-processing the live output by interpreting the live output into a computationally derived alphanumerical classifier.
-
27. The method of claim 13, wherein the knowledge discovered from biological data comprises diagnosis or prognosis of a disease state.
-
28. The method of claim 13, wherein the knowledge discovered from biological data comprises efficacy of treatment of a disease state.
-
29. A system for enhancing knowledge discovered from biological data using a support vector machine comprising:
-
a storage device for storing a training data set and a test data set;
a processor for executing a support vector machine;
the processor further operable for;
collecting the training data set from the database, pre-processing the training data set to add meaning to each of a plurality of training data points, training the support vector machine using the pre-processed training data set;
in response to training the support vector machine, collecting the test data set from the database, pre-processing the test data set to expand each of a plurality of test data points, testing the trained support vector machine using the pre-processed test data set to generate a test output, and in response to receiving the test output of the trained support vector machine, post-processing the test output to determine if the test output is an optimal solution. - View Dependent Claims (30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46)
wherein the processor is further operable to store the training data set in the storage device prior to collection and pre-processing of the training data set and to store the test data set in the storage device prior to collection and pre-processing of the test data set.
-
-
31. The system of claim 29, further comprising a display device for displaying the post-processed test data.
-
32. The system of claim 29, wherein each training data point comprises a vector having one or more original coordinates;
- and
wherein pre-processing the training data set to add meaning to each training data point comprises adding dimensionality to each training data point by adding one or more new coordinates to the vector.
- and
-
33. The system of claim 32, wherein the one or more new coordinates added to the vector are derived by applying a transformation to one or more of the original coordinates.
-
34. The system of claim 33, wherein the transformation is based on expert knowledge.
-
35. The system of claim 33, wherein the transformation is computationally derived.
-
36. The system of claim 33, wherein the training data set comprises a continuous variable;
- and
wherein the transformation comprises optimally categorizing the continuous variable of the training data set.
- and
-
37. The system of claim 29, wherein the test output comprises a continuous variable;
- and
wherein post-processing the test output comprises optimally categorizing the continuous variable of the test data set.
- and
-
38. The system of claim 29, wherein the knowledge to be discovered from the data relates to a regression or density estimation;
-
wherein a training output comprises a continuous variable; and
wherein the processor is further operable for post-processing the training output by optimally categorizing the continuous variable of the training output.
-
-
39. The system of claim 38, wherein optimally categorizing the training output comprises determining optimal cutoff points in the continuous variable based on entropy calculations.
-
40. The system of claim 29, wherein the processor is further operable for:
-
selecting a kernel for the support vector machine prior to training the support vector machine;
in response to post-processing the test output, determining that the test output is not the optimal solution;
adjusting the selection of the kernel; and
in response to adjusting the selection of the kernel, retraining and retesting the support vector machine.
-
-
41. The system of claim 40, wherein the selection of a kernel is based on prior performance or historical data and is dependant on the nature of the knowledge to be discovered from the data or the nature of the data.
-
42. The system of claim 29, wherein a live data set is stored in the storage device;
- and
wherein the processor is further operable for;
in response to post-processing the test output, determining that the test output is the optimal solution, collecting the live data set from the storage device;
pre-processing the live data set to expand each of a plurality of live data points;
inputting the pre-processed live data set to the support vector machine for processing to generate live output; and
receiving the live output of the trained support vector machine.
- and
-
43. The system of claim 42, wherein the processor is further operable for post-processing the live output by interpreting the live output into a computationally derived alphanumerical classifier.
-
44. The system of claim 43, wherein the communications device is further operable to send the alphanumerical classifier to the remote source or another remote source.
-
45. The system of claim 29, wherein the knowledge discovered from biological data comprises diagnosis or prognosis of a disease state.
-
46. The system of claim 29, wherein the knowledge discovered from biological data comprises efficacy of treatment of a disease state.
-
47. A method for diagnosing disease, comprising, using a learning machine comprising,
a) pre-processing a training data set derived from biological data to expand each of a plurality of training data points; -
training the learning machine using the pre-processed training data set;
pre-processing a test data set derived from biological data to expand each of a plurality of test data points;
testing the trained learning machine using the pre-processed test data set to generate a test output; and
in response to receiving the test output of the trained learning machine, post-processing the test output to determine if the knowledge discovered from the pre-processed test data set is desirable. - View Dependent Claims (48, 49, 50, 51)
-
Specification