Method and apparatus for mapping components of descriptor vectors to a space that discriminates between groups
First Claim
1. A program storage device readable by a machine, tangibly embodying a program of instructions executable by the machine to perform method steps for transforming components of descriptor vectors that characterize items, wherein said descriptor vectors are classified into groups, said method steps comprising:
- generating first data representing differences between said groups of descriptor vectors;
generating second data representing variation within said groups of said descriptor vectors;
identifying a set of component vectors that maximizes an F distributed criterion function, said criterion function having a numerator based upon said first data and a denominator based upon said second data;
generating an F distributed statistic for subsets of said component vectors, said statistic having a numerator based upon said first data and a denominator based upon said second data;
for each particular subset of component vectors, calculating a probability value for the F-distributed statistic associated with the particular subset;
selecting a probability value from probability values for said subsets of component vectors based upon a predetermined criterion;
identifying the subset of said component vectors associated with the selected probability value; and
for at least one descriptor vector for the items, mapping said at least one descriptor vector to a space corresponding to the subset of component vectors associated with the selected probability value for subsequent processing.
1 Assignment
0 Petitions
Accused Products
Abstract
The method of the present invention transforms descriptor vectors that characterize items partitioned into groups into a space that discriminates between those groups in a well defined optimal sense. First data is generated that represents a differences between the groups of descriptor vectors. Second data is generated representing variation within the groups of descriptor vectors. A set of component vectors is then identified that maximizes an F distributed criterion function that measures differences of descriptor vectors between groups relative to variations of descriptor vectors within groups. A statistic is generated for subsets of the component vectors. For each particular subset of component vectors, a probability value for the statistic associated with the particular subset is calculated. The subset with the minimum probability value is selected. Finally, one or more of the descriptor vectors for the items are mapped to a space corresponding to the selected subset of component vectors.
44 Citations
26 Claims
-
1. A program storage device readable by a machine, tangibly embodying a program of instructions executable by the machine to perform method steps for transforming components of descriptor vectors that characterize items, wherein said descriptor vectors are classified into groups, said method steps comprising:
-
generating first data representing differences between said groups of descriptor vectors;
generating second data representing variation within said groups of said descriptor vectors;
identifying a set of component vectors that maximizes an F distributed criterion function, said criterion function having a numerator based upon said first data and a denominator based upon said second data;
generating an F distributed statistic for subsets of said component vectors, said statistic having a numerator based upon said first data and a denominator based upon said second data;
for each particular subset of component vectors, calculating a probability value for the F-distributed statistic associated with the particular subset;
selecting a probability value from probability values for said subsets of component vectors based upon a predetermined criterion;
identifying the subset of said component vectors associated with the selected probability value; and
for at least one descriptor vector for the items, mapping said at least one descriptor vector to a space corresponding to the subset of component vectors associated with the selected probability value for subsequent processing. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
where ŵ
is some vector, and C is a constant based upon degrees of freedom in ε
b and ε
w.
-
-
4. The program storage of claim 3, wherein C is determined as follows:
-
where N represents the number of groups of items, ni represents the number of items in a group, and Σ
ni represents the sum of ni for the N groups.
-
-
5. The program storage device of claim 3, wherein said statistic for a given subset of component vectors is based upon value of said criterion function for said subset of component vectors.
-
6. The program storage device of claim 5, wherein said statistic for a given subset of component vectors has the following form:
-
7. The program storage device of claim 6, wherein said probability value for a particular F-distributed statistic represents a probability value that the particular F-distributed statistic could have been larger by chance.
-
8. The program storage device of claim 7, wherein said probability value selected from probability values for said subsets of component vectors is a minimum probability value of said probability values for said subsets of component vectors.
-
9. The program storage device of claim 2, wherein the step of identifying a set of component vectors that maximizes an F distributed criterion function comprises the substeps of:
-
determining a set of (eigenvalue, eigenvector) pairs for the matrix ε
wdetermining said set of component vectors based upon said set of (eigenvalue, eigenvector) pairs for the matrix ε
w.
-
-
10. The program storage device of claim 1,
wherein the mapping step for said descriptor vector performs a loop over each component vector belonging to the subset of component vectors associated with the selected probability; wherein, in each iteration of said loop, dot product of said descriptor vector with a transpose of a unit vector for the given component vector is added to a running sum.
-
11. The program storage device of claim 1, wherein said items comprise genotypes partitioned into groups based upon phenotypes exhibited by said genotypes, and wherein descriptor vectors associated with said genotypes represent one of biological, chemical, and physical properties of said genotypes.
-
12. The program storage device of claim 1, wherein said items comprise individuals partitioned into groups based upon one of characteristics of said individuals and categories of auto insurance polices, and wherein descriptor vectors associated with said individuals represents risk of said individual.
-
13. The program storage device of claim 1, wherein said items comprise plant species partitioned into groups based upon one of characteristics of said plant species and categories of treatments applied to said plant species, and wherein descriptor vectors associated with said plant species represents a characteristic of said plant species.
-
14. A computer-implemented method for transforming descriptor vectors that characterize items, wherein said descriptor vectors are classified into groups, said method comprising the steps of:
-
generating first data representing differences between said groups of descriptor vectors;
generating second data representing variation with said groups of descriptor vectors;
identifying a set of component vectors that maximizes an F distributed criterion function, said criterion function having a numerator based upon said first data and a denominator based upon said second data;
generating an F distributed statistic for subsets of said component vectors, said statistic having a numerator based upon said first data and a denominator based upon said second data;
for each particular subset of component vectors, calculating a probability value for the F-distributed statistic associated with the particular subset;
selecting a probability value from probability values for said subsets of component vectors based upon a predetermined criterion;
identifying the subset of said component vectors associated with the selected probability value; and
for at least one descriptor vector for the items, mapping said at least one descriptor vector to a space corresponding to the subset of component vectors associated with the selected probability value for subsequent processing. - View Dependent Claims (15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26)
where ŵ
is some vector, and C is a constant based upon degrees of freedom in ε
b and ε
w.
-
-
17. The method of claim 16, wherein C is determined as follows:
-
where N represents the number of groups of items, ni represents the number of items in a group, and Σ
ni represents the sum of ni for the N groups.
-
-
18. The method of claim 16, wherein said statistic for a given subset of component vectors is based upon value of said criterion function for said subset of component vectors.
-
19. The method of claim 18, wherein said statistic for a given subset of component vectors has the following form:
-
20. The method of claim 19, wherein said probability value for a particular F-distributed statistic represents a probability value that the particular F-distributed statistic could have been larger by chance.
-
21. The method of claim 20, wherein said probability value selected from probability values for said subsets of component vectors is a minimum probability value of said probability values for said subsets of component vectors.
-
22. The method of claim 15, wherein the step of identifying a set of component vectors that maximizes an F distributed criterion function comprises the substeps of:
-
determining a set of (eigenvalue, eigenvector) pairs for the matrix ε
wdetermining said set of component vectors based upon said set of (eigenvalue, eigenvector) pairs for the matrix ε
w.
-
-
23. The method of claim 14,
wherein the mapping step for said descriptor vector performs a loop over each component vector belonging to the subset of component vectors associated with the selected probability; wherein, in each iteration of said loop, dot product of said descriptor vector with a transpose of a unit vector for the given component vector is added to a running sum.
-
24. The method of claim 14, wherein said items comprise genotypes partitioned into groups based upon phenotypes exhibited by said genotypes, and wherein descriptor vectors associated with said genotypes represent one of biological, chemical, and physical properties of said genotypes.
-
25. The method of claim 14, wherein said items comprise individuals partitioned into groups based upon one of characteristics of said individuals and categories of auto insurance polices, and wherein descriptor vectors associated with said individuals represents risk of said individual.
-
26. The method of claim 14, wherein said items comprise plant species partitioned into groups based upon one of characteristics of said plant species and categories of treatments applied to said plant species, and wherein descriptor vectors associated with said plant species represents a characteristic of said plant species.
Specification