Selection of features predictive of biological conditions using protein mass spectrographic data
First Claim
1. A method for identification distinguishing between different biological conditions using protein expression data contained in a plurality of mass spectra generated from mass spectrographic measurement of a plurality of samples from subjects having the different biological conditions, the method comprising:
- downloading the plurality of mass spectra into a computer system comprising a processor and a storage device, wherein the processor is programmed to perform the steps of;
aligning the plurality of spectra, comprising;
selecting a first spectrum of the plurality of spectra as a baseline example;
sliding each spectral peak of a second spectrum of the plurality of spectra one at a time along a plurality of peaks within the baseline example;
constructing a similarity measure for comparing pairs of spectra, wherein the similarity measure includes a scoring function for obtaining a similarity score between each spectral peak of the second spectrum and the peaks within the baseline example, the similarity score being examined according to the relationship S(xi−
x0)=∥
xi,−
x0∥
22, where xi and x0 are feature vectors corresponding to peaks of an ith spectrum and the baseline spectrum, respectively;
offsetting the second spectrum relative to the baseline example according to the similarity score achieved for the second spectrum;
repeating the step of aligning the spectra for at least one additional spectrum to create a set of aligned spectra;
applying a feature selection algorithm to the set of aligned spectra to select a subset of spectral peaks that discriminate between the different biological conditions, wherein the feature selection algorithm is selected from SVM-recursive feature elimination and l0-norm minimization; and
training at least one support vector machine to discriminate between the plurality of different sample classes using the selected subset of spectral peaks, wherein the at least one support vector machine comprises a kernel;
processing the plurality of spectra using the at least one support vector machine;
generating a listing for display on a graphical display of at least one predictive feature within the plurality of spectra for distinguishing between the different biological conditions.
3 Assignments
0 Petitions
Accused Products
Abstract
Support vector machines are used to classify data contained within a structured dataset such as a plurality of signals generated by a spectral analyzer. The signals are pre-processed to ensure alignment of peaks across the spectra. Similarity measures are constructed to provide a basis for comparison of pairs of samples of the signal. A support vector machine is trained to discriminate between different classes of the samples. to identify the most predictive features within the spectra. In a preferred embodiment feature selection is performed to reduce the number of features that must be considered.
47 Citations
8 Claims
-
1. A method for identification distinguishing between different biological conditions using protein expression data contained in a plurality of mass spectra generated from mass spectrographic measurement of a plurality of samples from subjects having the different biological conditions, the method comprising:
-
downloading the plurality of mass spectra into a computer system comprising a processor and a storage device, wherein the processor is programmed to perform the steps of; aligning the plurality of spectra, comprising; selecting a first spectrum of the plurality of spectra as a baseline example; sliding each spectral peak of a second spectrum of the plurality of spectra one at a time along a plurality of peaks within the baseline example; constructing a similarity measure for comparing pairs of spectra, wherein the similarity measure includes a scoring function for obtaining a similarity score between each spectral peak of the second spectrum and the peaks within the baseline example, the similarity score being examined according to the relationship S(xi−
x0)=∥
xi,−
x0∥
22, where xi and x0 are feature vectors corresponding to peaks of an ith spectrum and the baseline spectrum, respectively;offsetting the second spectrum relative to the baseline example according to the similarity score achieved for the second spectrum; repeating the step of aligning the spectra for at least one additional spectrum to create a set of aligned spectra; applying a feature selection algorithm to the set of aligned spectra to select a subset of spectral peaks that discriminate between the different biological conditions, wherein the feature selection algorithm is selected from SVM-recursive feature elimination and l0-norm minimization; and training at least one support vector machine to discriminate between the plurality of different sample classes using the selected subset of spectral peaks, wherein the at least one support vector machine comprises a kernel; processing the plurality of spectra using the at least one support vector machine; generating a listing for display on a graphical display of at least one predictive feature within the plurality of spectra for distinguishing between the different biological conditions. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
Specification