Modifying an unreliable training set for supervised classification
First Claim
Patent Images
1. A method of modifying a training set for use in data classification, said method comprising:
- determining at least one datum of said training set is incorrect;
reconstructing said at least one datum of said training set to provide a modified training set; and
wherein said reconstructing comprises modifying a label associated with said at least one datum to provide a correct label.
1 Assignment
0 Petitions
Accused Products
Abstract
An unreliable training set is modified to provide for a reliable training set to be used in supervised classification. The training set is modified by determining which data of the set are incorrect and reconstructing those incorrect data. The reconstruction includes modifying the labels associated with the data to provide for correct labels. The modification can be performed iteratively.
-
Citations
25 Claims
-
1. A method of modifying a training set for use in data classification, said method comprising:
-
determining at least one datum of said training set is incorrect;
reconstructing said at least one datum of said training set to provide a modified training set; and
wherein said reconstructing comprises modifying a label associated with said at least one datum to provide a correct label.
-
-
2. A method of modifying a training set for use in data classification, said method comprising:
-
determining at least one datum of said training set is incorrect;
reconstructing said at least one datum of said training set to provide a modified training set;
wherein said training set comprises a plurality of data, each with a corresponding label, and wherein said determining comprises;
dividing said plurality of data into a plurality of groups; and
applying one or more rules to at least a portion of the data of at least one group of said plurality of groups to determine if any of said corresponding labels of said at least a portion of the data is incorrect; and
wherein said applying comprises applying one or more rules to the data of each group of said plurality of groups to determine if any corresponding labels is incorrect.
-
-
3. A method of modifying a training set for use in data classification, said method comprising:
-
determining at least one datum of said training set is incorrect;
reconstructing said at least one datum of the training set to provide a modified training set;
further comprising determining whether said modified training set is acceptable and repeating said determining and said reconstructing when said modified training set is unacceptable;
wherein said determining whether said modified training set is acceptable comprises;
creating a set of rules based on said modified training set;
using said set of rules to instantiate a classifier related to the modified training set;
comparing results of the instantiation with one or more predetermined conditions to determine if the modified training set is acceptable; and
wherein said creating and using are based on a progressive classification technique.
-
-
4. A method of modifying a training set for use in data classification, said method comprising:
-
determining at least one datum of said training set is incorrect;
reconstructing said at least one datum of the training set to provide a modified training set;
further comprising determining whether said modified training set is acceptable and repeating said determining and said reconstructing when said modified training set is unacceptable;
wherein said determining whether said modified training set is acceptable comprises;
creating a set of rules based on said modified training set;
using said set of rules to instantiate a classifier related to the modified training set;
comparing results of the instantiation with one or more predetermined conditions to determine if the modified training set is acceptable; and
wherein said creating and using are based on a genetic classification technique.
-
-
5. A method of modifying a training set for use in data classification, said training set comprising a plurality of data, each with a corresponding label, said method comprising:
-
determining at least one datum of said training set is incorrect, said determining comprising dividing said plurality of data into a plurality of groups and applying one or more rules to at least a portion of the data of at least one group of said plurality of groups to determine if any of said corresponding labels of said at least a portion of the data is incorrect; and
p1 reconstruction said at least one datum of said training set to provide a modified training set, wherein said reconstructing comprises;
constructing a contingency table for the data of said plurality of groups;
creating a histogram from said contingency table;
identifying any regions of low confidence from said histogram; and
modifying labels associated with data identified to be within a region of low confidence.
-
-
6. A method of modifying a training set for use in data classification, said training set comprising a plurality of n-dimensional feature vectors, each feature vector having an associated label, said method comprising:
-
determining at least one datum of said training set is incorrect, said at least one datum comprising at least one of a feature vector or its associated label; and
reconstructing without discarding said at least one datum of said training set to provide a modified training set for use in data classification. - View Dependent Claims (7, 8, 9, 10)
dividing said plurality of data into a plurality of groups; and
applying one or more rules to at least a portion of the data of at least one group of said plurality of groups to determine if any of said corresponding labels of said at least a portion of the data is incorrect.
-
-
9. The method of claim 6, further comprising determining whether said modified training set is acceptable and repeating said determining and said reconstructing when said modified training set is unacceptable.
-
10. The method of claim 9, wherein said determining whether said modified training set is acceptable comprises:
-
creating a set of rules based on said modified training set;
using said set of rules to instantiate a classifier related to the modified training set; and
comparing results of the instantiation with one or more predetermined conditions to determine if the modified training set is acceptable.
-
-
11. An article of manufacture comprising
a computer useable medium having computer readable program code means embodied therein for causing the modification of a training set for use in data classification, the computer readable program code means in said article of manufacture comprising: -
computer readable program code means for causing a computer to effect determining at least one datum of said training set is incorrect;
computer readable program code means for causing a computer to effect reconstructing said at least one datum of said training set to provide a modified training set; and
wherein said computer readable program code means for causing a computer to effect reconstructing comprises computer readable program code means for causing a computer to effect modifying a label associated with said at least one datum to provide a correct label.
-
-
12. An article of manufacture comprising
a computer useable medium having computer readable program code means embodied therein for causing the modification of a training set for use in data classification, said training set comprising a plurality of n-dimensional feature vectors, each feature vector having an associated label, the computer readable program code means in said article of manufacture comprising: -
computer readable program code means for causing a computer to effect determining at least one datum of said training set is incorrect, said at least one datum comprising at least one of a feature vector or its associated label; and
computer readable program code means for causing a computer to effect reconstructing without discarding said at least one datum of said training set to provide a modified training set for use in data classification. - View Dependent Claims (13, 14, 15, 16, 17)
computer readable program code means for causing a computer to effect dividing said plurality of data into a plurality of groups; and
computer readable program code means for causing a computer to effect applying one or more rules to at least a portion of the data of at least one group of said plurality of groups to determine if any of said corresponding labels of said at least a portion of the data is incorrect.
-
-
14. The article of manufacture of claim 13, wherein said computer readable program code means for causing a computer to effect applying comprises computer readable program code means for causing a computer to effect applying one or more rules to the data of each group of said plurality of groups to determine if any corresponding labels is incorrect.
-
15. The article of manufacture of claim 13, an article of manufacture comprising
a computer useable medium having computer readable program code means embodied therein for causing the modification of a training set for use in data classification, the computer readable program code means in said article of manufacture comprising: -
computer readable program code means for causing a computer to effect determining at least one datum of said training set is incorrect;
computer readable program code means for causing a computer to effect reconstructing said at least one datum of said training set to provide a modified training set; and
wherein said training set comprises a plurality of data, each with a corresponding label, and wherein said computer readable program code means for causing a computer to effect determining comprises;
computer readable program code means for causing a computer to effect dividing said plurality of data into a plurality of groups; and
computer readable program code means for causing a computer to effect applying one or more rules to at least a portion of the data of at least one group of plurality of groups to determine if any of said corresponding labels of said at least a portion of data is incorrect; and
wherein said computer readable program code means for causing a computer to effect reconstructing comprises;
computer readable program code means for causing a computer to effect constructing a contingency table for the data of said plurality of groups;
computer readable program code means for causing a computer to effect creating a histogram from said contingency table;
computer readable program code means for causing a computer to effect identifying any regions of low confidence from said histogram;
computer readable program code means for causing a computer to effect modifying labels associated with data identified to be within a region of low confidence.
-
-
16. The article of manufacture of claim 12, further comprising computer readable program code means for causing a computer to effect determining whether said modified training set is acceptable and repeating said determining and said reconstructing when said modified training set is unacceptable.
-
17. The article of manufacture of claim 16, wherein said computer readable program code means for causing a computer to effect determining whether said modified training set is acceptable comprises:
-
computer readable program code means for causing a computer to effect creating a set of rules based on said modified training set;
computer readable program code means for causing a computer to effect using said set of rules to instantiate a classifier related to the modified training set; and
computer readable program code means for causing a computer to effect comparing results of the instantiation with one or more predetermined conditions to determine if the modified training set is acceptable.
-
-
18. A system of modifying a training set for use in data classification, said system comprising:
-
means for determining at least one datum of said training set is incorrect;
a reconstruction unit adapted to reconstruct said at least one datum of said training set to provide a modified training set;
wherein said reconstruction unit is further adapted to modify a label associated with said at least one datum to provide a correct label.
-
-
19. A system of modifying a training set for use in data classification, said system comprising:
-
means for determining at least one datum of said training set is incorrect;
a reconstruction unit adapted to reconstruct said at least one datum of said training set to provide a modified training set;
wherein said training set comprises a plurality of data, each with a corresponding label, and wherein said means for determining comprises;
means for dividing said plurality of data into a plurality of groups; and
means for applying one or more rules to at least a portion of the data of at least of one group of said plurality of groups to determine if any of said corresponding labels of said at least a portion of the data is incorrect; and
wherein said means for applying comprises means for applying one or more rules to the data of each group of said plurality of groups to determine if any corresponding labels is incorrect.
-
-
20. A system of modifying a training set for use in data classification, said system comprising:
-
means for determining at least one datum of said training set is incorrect;
a reconstruction unit adapted to reconstruct said at least one datum of said training set to provide a modified training set;
wherein said training set comprises a plurality of data, each with a corresponding label, and wherein said means for determining comprises;
means for dividing said plurality of data into a plurality of groups; and
means for applying one or more rules to at least a portion of the data of at least of one group of said plurality of groups to determine if any of said corresponding labels of said at least a portion of the data is incorrect; and
wherein said reconstruction unit comprises;
means for constructing a contingency table for the data of said plurality of groups;
means for creating a histogram from said contingency table;
means for identifying any regions of low confidence from said histogram; and
means for modifying labels associated with data identified to be within a region of low confidence.
-
-
21. A system of modifying a training set for use in data classification, said training set comprising a plurality of n-dimensional feature vectors, each feature vector having an associated label, said system comprising:
-
means for determining at least one datum of said training set is incorrect, said at least one datum comprising at least one of a feature vector or its associated label; and
a reconstruction unit adapted to reconstruct without discarding said at least one datum of said training set to provide a modified training set for use in data classification. - View Dependent Claims (22, 23, 24, 25)
means for dividing said plurality of data into a plurality of groups; and
means for applying one or more rules to at least a portion of the data of at least one group of said plurality of groups to determine if any of said corresponding labels of said at least a portion of the data is incorrect.
-
-
24. The system of claim 21, further comprising means for determining whether said modified training set is acceptable and means for repeating said determining and said reconstructing when said modified training set is unacceptable.
-
25. The system of claim 24, wherein said means for determining whether said modified training set is acceptable comprises:
-
means for creating a set of rules based on said modified training set;
means for using said set of rules to instantiate a classifier related to the modified training set; and
means for comparing results of the instantiation with one or more predetermined conditions to determine if the modified training set is acceptable.
-
Specification