Methods of identifying biological patterns using multiple data sets

US 6,882,990 B1
Filed: 08/07/2000
Issued: 04/19/2005
Est. Priority Date: 05/01/1999
Status: Expired due to Fees

First Claim

Patent Images

1. A method for enhancing knowledge discovery using multiple support vector machines comprising:

pre-processing a first training biological data set and a second training biological data set in order to add dimensionality to each of a plurality of training biological data points;

training one or more first support vector machines using the first pre-processed training biological data set, each of the first support vector machines comprising different kernels;

training one or more second support vector machines using the second pre-processed training data set, each of the second support vector machines comprising different kernels;

pre-processing a first test biological data set in the same manner as was the first training biological data sets and pre-processing a second test biological data set in the same manner as was the second training biological data set;

testing each of the first trained support vector machines using the first pre-processed test biological data set and testing each of the second trained support vector machines using the second pre-processed test biological data set;

in response to receiving a first test output from each of the first trained support vector machines, comparing each of the first test outputs with each other to determine which if any of the first test outputs is a first optimal solution;

in response to receiving a second test output from each of the second trained support vector machines, comparing each of the second test outputs with each other to determine which if any of the second test outputs is a second optimal solution;

combining the first optimal solution with the second optimal solution to create a new input data set to be input into one or more additional support vector machines.

View all claims

7 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems and methods for enhancing knowledge discovery from data using multiple learning machines in general and multiple support vector machines in particular. Training data for a learning machine is pre-processed in order to add meaning thereto. Multiple support vector machines, each comprising distinct kernels, are trained with the pre-processed training data and are tested with test data that is pre-processed in the same manner. The test outputs from multiple support vector machines are compared in order to determine which of the test outputs if any represents a optimal solution. Selection of one or more kernels may be adjusted and one or more support vector machines may be retrained and retested. Optimal solutions based on distinct input data sets may be combined to form a new input data set to be input into one or more additional support vector machine. The methods, systems and devices of the present invention comprise use of Support Vector Machines for the identification of patterns that are important for medical diagnosis, prognosis and treatment. Such patterns may be found in many different datasets. The present invention also comprises methods and compositions for the treatment and diagnosis of medical conditions.

133 Citations

17 Claims

1. A method for enhancing knowledge discovery using multiple support vector machines comprising:
- pre-processing a first training biological data set and a second training biological data set in order to add dimensionality to each of a plurality of training biological data points;
  
  training one or more first support vector machines using the first pre-processed training biological data set, each of the first support vector machines comprising different kernels;
  
  training one or more second support vector machines using the second pre-processed training data set, each of the second support vector machines comprising different kernels;
  
  pre-processing a first test biological data set in the same manner as was the first training biological data sets and pre-processing a second test biological data set in the same manner as was the second training biological data set;
  
  testing each of the first trained support vector machines using the first pre-processed test biological data set and testing each of the second trained support vector machines using the second pre-processed test biological data set;
  
  in response to receiving a first test output from each of the first trained support vector machines, comparing each of the first test outputs with each other to determine which if any of the first test outputs is a first optimal solution;
  
  in response to receiving a second test output from each of the second trained support vector machines, comparing each of the second test outputs with each other to determine which if any of the second test outputs is a second optimal solution;
  
  combining the first optimal solution with the second optimal solution to create a new input data set to be input into one or more additional support vector machines.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
- - 2. The method of claim 1, wherein pre-processing the first training biological data set and the second training biological data set further comprises:
    - determining that at least one of the training biological data points is dirty; and
      
      in response to determining that the training biological data point is dirty, cleaning the dirty training biological data point.
  - 3. The method of claim 2, wherein cleaning the dirty training biological data point comprises deleting, repairing or replacing the data point.
  - 4. The method of claim 1, wherein each training biological data point comprises a vector having one or more original coordinates;
    - andwherein pre-processing the training biological data set comprises adding one or more new coordinates to the vector.
  - 5. The method of claim 4, wherein the one or more new coordinates added to the vector are derived by applying a transformation to one or more of the original coordinates.
  - 6. The method of claim 5, wherein the transformation is based on expert knowledge.
  - 7. The method of claim 5, wherein the transformation is computationally derived.
  - 8. The method of claim 1, wherein the training data set comprises a continuous variable;
    - andwherein the transformation comprises optimally categorizing the continuous variable of the training data set.
  - 9. The method of claim 1, wherein comparing each of the first test outputs with each other and comparing each of the second test outputs with each other comprises:
    - post-processing each of the test outputs by interpreting each of the test outputs into a common format;
      
      comparing each of the first post-processed test outputs with each other to determine which of the first test outputs represents a first lowest global minimum error; and
      
      comparing each of the second post-processed lest outputs with each other to determine which of the second test outputs represents a second lowest global minimum error.
  - 10. The method of claim 1, wherein the knowledge to be discovered from the data relates to a regression or density estimation;
    - wherein each support vector machine produces a training output comprising a continuous variable; and
      
      wherein the method further comprises the step of post-processing each of the training outputs by optimally categorizing the training output to derive cutoff points in the continuous variable.
  - 11. The method of claim 1, further comprising the steps of:
    - in response to comparing each of the test outputs with each other, determining that none of the test outputs is the optimal solution;
      
      adjusting the different kernels of one or more of the plurality of support vector machines; and
      
      in response to adjusting the selection of the different kernels, retraining and retesting each of the plurality of support vector machines.
  - 12. The method of claim 11, wherein adjusting the different kernels is performed based on prior performance or historical data and is dependant on the nature of the knowledge to be discovered from the data or the nature of the data.

13. A computer-readable medium with computer-executable instructions for performing a method for:
- pre-processing a first training biological data set and a second training biological data set in order to add dimensionality to each of a plurality of training biological data points;
  
  training one or more first support vector machines using the first pre-processed training biological data set, each of the first support vector machines comprising different kernels;
  
  training one or more second support vector machines using the second pre-processed training data set, each of the second support vector machines comprising different kernels;
  
  pre-processing a first test biological data set in the same manner as was the first training biological data sets and pre-processing a second test biological data set in the same manner as was the second training biological data set;
  
  testing each of the first trained support vector machines using the first pre-processed test biological data set and testing each of the second trained support vector machines using the second pre-processed test biological data set;
  
  in response to receiving a first test output from each of the first trained support vector machines, comparing each of the first test outputs with each other to determine which if any of the first test outputs is a first optimal solution;
  
  in response to receiving a second test output from each of the second trained support vector machines, comparing each of the second test outputs with each other to determine which if any of the second test outputs is a second optimal solution;
  
  combining the first optimal solution with the second optimal solution to create a new input data set to be input into one or more additional support vector machines.
- View Dependent Claims (14, 15, 16, 17)
- - 14. The computer-readable medium with computer-executable instructions of claim 13, wherein each training biological data point comprises a vector having one or more original coordinates;
    - andwherein pre-processing the training biological data set comprises adding one or more new coordinates to the vector.
  - 15. The computer-readable medium with computer-executable instructions of claim 13, wherein the training data set comprises a continuous variable;
    - andwherein the transformation comprises optimally categorizing the continuous variable of the training data set.
  - 16. The computer-readable medium with computer-executable instructions of claim 13, wherein comparing each of the first test outputs with each other and comparing each of the second test outputs with each other comprises:
    - post-processing each of the test outputs by interpreting each of the test outputs into a common format;
      
      comparing each of the first post-processed test outputs with each other to determine which of the first test outputs represents a first lowest global minimum error; and
      
      comparing each of the second post-processed test outputs with each other to determine which of the second test outputs represents a second lowest global minimum error.
  - 17. The computer-readable medium with computer-executable instructions of claim 13, further comprising the steps of:
    - in response to comparing each of the test outputs with each other, determining that none of the test outputs is the optimal solution;
      
      adjusting the different kernels of one or more of the plurality of support vector machines; and
      
      in response to adjusting the selection of the different kernels, retraining and retesting each of the plurality of support vector machines.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Curtis Anderson, Health Discovery Corporation, James Roberts, Joe Mckenzie, Jules B. Paderewski, Julian N. Stern, Memorial Health Systems Incorporated, Timothy P. O'Hayer
Original Assignee
BIOwulf Technologies, LLC
Inventors
Weston, Jason, Barnhill, Stephen, Guyon, Isabelle
Primary Examiner(s)
Patel, Ramesh
Assistant Examiner(s)
Holmes, Michael B.

Application Number

US09/633,410
Time in Patent Office

1,716 Days
Field of Search

706/12, 706/16
US Class Current

706/16
CPC Class Codes

G06F 18/211   Selection of the most signi...

G06F 18/2113   by ranking or filtering the...

G06F 18/2115   by evaluating different sub...

G06F 18/214   Generating training pattern...

G06F 18/2411   based on the proximity to a...

G06N 20/00   Machine learning

G06N 20/10   using kernel methods, e.g. ...

G06Q 10/0637   Strategic management or ana...

G06Q 10/10   Office automation; Time man...

G06Q 20/10   specially adapted for elect...

G06Q 40/06   Asset management; Financial...

G06T 7/0012   Biomedical image inspection

G06V 10/764   using classification, e.g. ...

G06V 10/771   Feature selection, e.g. sel...

G06V 10/774   Generating sets of training...

G16B 25/00   ICT specially adapted for h...

G16B 40/00   ICT specially adapted for b...

G16B 40/20   Supervised data analysis

G16B 40/30   Unsupervised data analysis

G16H 10/40   for data related to laborat...

Y02A 90/10 : Information and communicati...

View All

Methods of identifying biological patterns using multiple data sets

First Claim

7 Assignments

0 Petitions

Accused Products

Abstract

133 Citations

17 Claims

Specification

Solutions

Use Cases

Quick Links

Methods of identifying biological patterns using multiple data sets

First Claim

7 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

133 Citations

17 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links