Neural network system with N-gram term weighting method for molecular sequence classification and motif identification
First Claim
Patent Images
1. A method for training a neural network to predict membership in a family of linear sequences comprising;
- providing a training set of full length member sequences, a training set of full length non-member sequences and a training set of family motif sequences;
deriving term weights for n-gram terms by dividing the number of occurrences of each n-gram term in the motif set by the number of occurrences in the full length member set;
deriving a set of global vectors for the full length member set and a set of global vectors for the full length non-member set using an n-gram method;
deriving a set of motif vectors for the full length member set and a set of motif vectors for the fall length non-member set by multiplying each term in the global vector set by its term weighting factor;
providing a neural network with multiple output units to represent one family;
using the global vector of the member sequence set to train the positive full length output unit;
using the motif vector of the member set to train the positive motif output unit;
using the global vector of the non-member set to train the fall length negative output unit; and
using the motif vector of the non-member set to train the negative motif output unit of said neural network.
3 Assignments
0 Petitions
Accused Products
Abstract
A method for rapid and sensitive protein family identification is disclosed. The new designs include an n-gram term weighting algorithm for extracting local motif patterns, an enhanced n-gram method for extracting residues of long-range correlation, and integrated neural networks for combining global and motif sequence information.
-
Citations
5 Claims
-
1. A method for training a neural network to predict membership in a family of linear sequences comprising;
-
providing a training set of full length member sequences, a training set of full length non-member sequences and a training set of family motif sequences; deriving term weights for n-gram terms by dividing the number of occurrences of each n-gram term in the motif set by the number of occurrences in the full length member set; deriving a set of global vectors for the full length member set and a set of global vectors for the full length non-member set using an n-gram method; deriving a set of motif vectors for the full length member set and a set of motif vectors for the fall length non-member set by multiplying each term in the global vector set by its term weighting factor; providing a neural network with multiple output units to represent one family; using the global vector of the member sequence set to train the positive full length output unit; using the motif vector of the member set to train the positive motif output unit; using the global vector of the non-member set to train the fall length negative output unit; and using the motif vector of the non-member set to train the negative motif output unit of said neural network. - View Dependent Claims (2, 3, 4)
-
-
5. A method of obtaining term weighted n-grams for use in training a neural network comprising:
-
providing a training set of full length member sequences and a training set of family motif sequences; and deriving term weights by applying the formula
space="preserve" listing-type="equation">W.sub.k =Σ
M.sub.ik /Σ
F.sub.ikto said motif set and said full length set, where Wk is the weight factor for the k-th n-gram term in the input vector, and Fik and Mik are total counts of the k-th n-gram term in the i-th sequence of the full-length sequence set and motif set, respectively.
-
Specification