Automated machine-learning classification using feature scaling

US 8,885,928 B2
Filed: 10/25/2006
Issued: 11/11/2014
Est. Priority Date: 10/25/2006
Status: Expired due to Fees

First Claim

Patent Images

1. A method of automated machine-learning classification, comprising:

establishing, within a computer, an original feature set, each feature of the original feature set having a predictive value, the predictive value of some features being uncertain for characterizing expected input items during classification thereof;

selecting with the computer a feature set, the feature set being a subset of the original feature set;

obtaining to the computer a number of training items having values for a plurality of different features in the feature set;

calculating with the computer scores for the different features of the feature set using a scoring technique, the score for a given feature being a measure of prediction ability for the given feature and calculated as S=|aF^−

1(tpr)−

bF^−

1(fpr)|, where S is the score, tpr is the true positive rate of the given feature equal to a number of positive training cases containing a subject feature divided by a number of positive training cases, fpr is the false positive rate of the given feature equal to a number of negative training cases containing the subject feature divided by a number of negative training cases, |*| is an absolute value, F⁻(*) is an inverse of an assumed probability distribution function, and a and b are constants;

scaling the values for the features of the feature set with the computer according to the scores for said features as adjusted feature values;

generating a classifier with the computer;

training the classifier using the adjusted feature values for the features of the feature set;

scaling the values for the features in the feature set of an input item with the computer according to the scores as adjusted feature values of the input item; and

classifying an input item using the computer and the adjusted feature values for the input item into the previously trained classifier.

View all claims

8 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Provided are systems, methods and techniques for machine-learning classification. In one representative embodiment, an item having values for a plurality of different features in a feature set is obtained, together with scores for the different features. The score for a given feature is a measure of prediction ability for that feature and was calculated as a function of a plurality of different occurrence metrics of the feature. The values for the features are scaled according to the scores for the features, and the item is classified by inputting the adjusted feature set values for the item into a previously trained classifier.

Citations

18 Claims

1. A method of automated machine-learning classification, comprising:
- establishing, within a computer, an original feature set, each feature of the original feature set having a predictive value, the predictive value of some features being uncertain for characterizing expected input items during classification thereof;
  
  selecting with the computer a feature set, the feature set being a subset of the original feature set;
  
  obtaining to the computer a number of training items having values for a plurality of different features in the feature set;
  
  calculating with the computer scores for the different features of the feature set using a scoring technique, the score for a given feature being a measure of prediction ability for the given feature and calculated as S=|aF^−
  
  1(tpr)−
  
  bF^−
  
  1(fpr)|, where S is the score, tpr is the true positive rate of the given feature equal to a number of positive training cases containing a subject feature divided by a number of positive training cases, fpr is the false positive rate of the given feature equal to a number of negative training cases containing the subject feature divided by a number of negative training cases, |*| is an absolute value, F⁻(*) is an inverse of an assumed probability distribution function, and a and b are constants;
  
  scaling the values for the features of the feature set with the computer according to the scores for said features as adjusted feature values;
  
  generating a classifier with the computer;
  
  training the classifier using the adjusted feature values for the features of the feature set;
  
  scaling the values for the features in the feature set of an input item with the computer according to the scores as adjusted feature values of the input item; and
  
  classifying an input item using the computer and the adjusted feature values for the input item into the previously trained classifier.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. A method according to claim 1, wherein the given score is a measure of separation between the plurality of different occurrence metrics, relative to an assumed probability distribution.
  - 3. A method according to claim 1, wherein the plurality of different occurrence metrics include a true positive rate for the given feature and a false positive rate for the given feature.
  - 4. A method according to claim 3, wherein the score for the given feature is a measure of separation between the true positive rate and the false positive rate.
  - 5. A method according to claim 4, wherein the measure of separation between the true positive rate and the false positive rate is calculated relative to an assumed probability distribution.
  - 6. A method according to claim 5, wherein the assumed probability distribution comprises a normal cumulative probability distribution function.
  - 7. A method according to claim 1, wherein the values for the features are scaled such that ranges of values for the different features are proportionate to the scores corresponding to said different features.
  - 8. A method according to claim 1, further comprising eliminating at least one feature from the original feature set in selecting the feature set using a second scoring technique different than the scoring technique used for calculating scores for the different features.

9. A method of automated machine learning classification, comprising:
- obtaining, to a first pre-processing portion of a computer, a training item having values for a plurality of different features in a feature set;
  
  calculating with a scoring technique implemented by the first pre-processing portion of the computer scores for the different features, the score for a given feature being calculated as S=|aF^−
  
  1(tpr)−
  
  bF^−
  
  1(fpr)|, where S is the score, tpr is the true positive rate of the given feature equal to a number of positive training cases containing a subject feature divided by a number of positive training cases, fpr is the false positive rate of the given feature equal to a number of negative training cases containing the subject feature divided by a number of negative training cases, |*| is an absolute value, F^−
  
  1(*) is an inverse of an assumed probability distribution function, and a and b are constants;
  
  scaling the values for the features with the first pre-processing portion of the computer according to the scores for said features, thereby obtaining adjusted feature set values for the training item;
  
  training a supervised machine-learning classifier using the adjusted feature set values from the first pre-processing portion of the computer;
  
  obtaining to a second pre-processing portion of a computer an unlabeled item having values for the plurality of different features in the feature set;
  
  calculating with the scoring technique implemented by the second pre-processing portion of the computer, further scores for the different features, the further score for a given feature being calculated as S;
  
  scaling the adjusted feature set values using the second pre-processing portion of the computer according to the further scores for said features, thereby obtaining modified feature set values for the unlabeled item;
  
  scaling the values for the features in the feature set of an input item with the computer according to the scores as adjusted feature values of the input item; and
  
  classifying the unlabeled item by inputting the modified feature set values into the supervised machine-learning classifier.
- View Dependent Claims (10, 11, 12, 13, 14)
- - 10. A method according to claim 9, wherein the score and the further score for the given feature were calculated as a function of a plurality of different occurrence metrics pertaining to the given feature.
  - 11. A method according to claim 10, wherein the occurrence metrics include a count of proper classification based on the given feature and a count of improper classification based on the given feature.
  - 12. A method according to claim 10, wherein the given score and the given further score are a measure of separation between the plurality of occurrence metrics, relative to an assumed probability distribution.
  - 13. A method according to claim 9, wherein the supervised machine-learning classifier is a Support Vector Machine.
  - 14. A method according to claim 9, further comprising eliminating at least one feature from the feature set prior to classifying the unlabeled item.

15. A non-transitory computer-readable medium storing computer-executable process steps for machine-learning classification, said process steps comprising:
- establishing an original feature set, each feature of the original feature set having a predictive value, the predictive value of some features being uncertain for characterizing expected input items during classification thereof;
  
  selecting a feature set, the feature set being a subset of the original feature set;
  
  obtaining a number of training items having values for a plurality of different features in the feature set;
  
  calculating with the computer scores for the different features of the feature set using a scoring technique, the score for a given feature being a measure of prediction ability for the given feature and calculated as S=|aF−
  
  1(tpr)−
  
  bF−
  
  1(fpr)|, where S is the score, tpr is the true positive rate of the given feature equal to a number of positive training cases containing a subject feature divided by a number of positive training cases, fpr is the false positive rate of the given feature equal to a number of negative training cases containing the subject feature divided by a number of negative training cases, |*| is an absolute value, F−
  
  1(*) is an inverse of an assumed probability distribution function, and a and b are constants;
  
  scaling the values for the features of the feature set according to the scores for said features as adjusted feature values;
  
  generating a classifier;
  
  training the classifier using the adjusted feature values of the feature set;
  
  scaling the values for the features in the feature set of an input item with the computer according to the scores as adjusted feature values of the input item; and
  
  classifying an input item using the adjusted feature values for the input item into the previously trained classifier.
- View Dependent Claims (16, 17, 18)
- - 16. A non-transitory computer-readable medium according to claim 15, wherein the given score is a measure of separation between the plurality of occurrence metrics, relative to an assumed probability distribution.
  - 17. A non-transitory computer-readable medium according to claim 15, wherein the plurality of occurrence metrics include a true positive rate for the given feature and a false positive rate for the given feature.
  - 18. A non-transitory computer-readable medium according to claim 15, wherein at least one feature is eliminated from the original feature set in selecting the feature set using a second scoring technique different than the scoring technique used for calculating scores for the different features.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Micro Focus LLC (Open Text Corporation)
Original Assignee
Hewlett-Packard Development Company, L.P. (HP Inc.)
Inventors
Forman, George Henry
Primary Examiner(s)
Park, Edward

Application Number

US11/552,968
Publication Number

US 20080101689A1
Time in Patent Office

2,939 Days
Field of Search

382/155, 382/161, 706/12, 706/44, 700/47, 700/48
US Class Current

382/159
CPC Class Codes

G06F 18/2113 by ranking or filtering the...

Automated machine-learning classification using feature scaling

First Claim

8 Assignments

0 Petitions

Accused Products

Abstract

Citations

18 Claims

Specification

Solutions

Use Cases

Quick Links

Automated machine-learning classification using feature scaling

First Claim

8 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

18 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links