Algorithm for automatic selection of discriminant term combinations for document categorization
First Claim
Patent Images
1. A method, comprising:
- associating a weighted value with each term in a set of terms contained within textual content, where a first word can affect the conveyed meaning of a second word;
expressing the set of terms derived from the textual content as a Volterra series;
correlating the set of terms to a particular subject by using a vector based technique of document classification; and
determining a mathematical indication of whether the content relates to the particular subject.
2 Assignments
0 Petitions
Accused Products
Abstract
A method and apparatus for document categorization are described. In one embodiment, the method comprises automatically selecting one or more discriminant term combinations and using the one or more discriminant term combinations for document categorization.
96 Citations
19 Claims
-
1. A method, comprising:
-
associating a weighted value with each term in a set of terms contained within textual content, where a first word can affect the conveyed meaning of a second word;
expressing the set of terms derived from the textual content as a Volterra series;
correlating the set of terms to a particular subject by using a vector based technique of document classification; and
determining a mathematical indication of whether the content relates to the particular subject. - View Dependent Claims (2, 3, 4, 5, 6)
associating a greater weight value to a first word if the first word is a statistically infrequently used word in a language rather than a statistically commonly used word in the language.
-
-
4. The method of claim 1, wherein the set of terms includes higher order terms.
-
5. The method of claim 1, further comprising:
-
determining a first set of values associated with the matrix of data by using a second algorithm, wherein the set of weighted values includes three or more values; and
correlating the first set of values associated with the matrix of data to a category nearest in value to the first set of values associated with the matrix of data.
-
-
6. The method of claim 1, further comprising:
modeling effects of terms as a non-linear function when a first term affects the meaning of a second term.
-
7. A method, comprising:
-
eliminating terms mathematically not useful in solving a solution in a corpus of terms contained in a body of text by using a first algorithm to generate a set of terms, wherein the algorithm assigns a zero weight value to the terms in the body of the text found not useful in mathematically in solving the solution;
generating a matrix of data to represent the set of terms;
determining a first value associated with the matrix of data by using a second algorithm;
correlating the first value associated with the matrix of data to a category nearest in value to the first value associated with the matrix of data; and
determining a mathematical indication of whether the content relates to a particular subject. - View Dependent Claims (8, 9, 10, 11)
comparing words in the content to common grammatical terms contained in a list, and removing from the corpus of terms any word that matches one or more of the common grammatical terms contained in the list.
-
-
9. The method of claim 7, further comprising:
limiting the incorporation of higher order terms into the set of terms to only those higher order terms that have classifiers that contribute to a statistically probable result.
-
10. The method of claim 7, further comprising:
modeling effects of terms as a non-linear function when a first term affects the meaning of a second term.
-
11. The method of claim 7, further comprising:
expressing the set of terms as a Volterra series.
-
12. An apparatus, comprising:
-
a software engine containing a plurality of modules;
a first module to identify and remove terms mathematically not useful in solving a solution in a corpus of terms contained in a body of text by using a first algorithm to generate a set of terms, wherein the algorithm assigns a zero weight value to the terms in the body of the text found not useful in mathematically in solving the solution;
a second module to generate a matrix of data to represent the set of terms;
a third module to use an algorithm in order to determine a first value associated with the matrix of data;
a fourth module to correlate the first value assigned to the matrix of data to a category nearest in value to the first value associated with the matrix of data; and
a fifth module to determine a mathematical of whether the content relates to a particular subject. - View Dependent Claims (13, 14, 15, 16)
a sixth module to limit the incorporation of higher order terms into the set of terms to only those higher order terms that have classifiers that contribute to a statistically probable result.
-
-
14. The apparatus of claim 13, wherein the first value associated with the matrix of data includes an extended space value associated with the higher order terms that have classifiers that contribute to a statistically probable result.
-
15. The apparatus of claim 13, further comprising:
a sixth module to model effects of terms as a non-linear function when a first term affects the meaning of a second term.
-
16. The apparatus of claim 13, further comprising:
a sixth module to express the set of terms as a Volterra series.
-
17. A method, comprising:
-
associating a weighted value with each term in a set of terms contained within content, where a first word can affect the conveyed meaning of a second word;
expressing the set of terms as a Volterra series, wherein the set of terms comprises words from the content selected based upon frequency of use of the words in the content, each of the words having a weighted value associated therewith;
correlating the set of terms to a particular subject by using a vector based technique of document classification;
determining a probability indication of whether the content relates to the particular subject; and
associating a greater weight value to a statistically infrequently used word in a language rather than a statistically commonly used word in the language.
-
-
18. An apparatus, comprising:
-
a software engine containing a plurality of modules;
a first module to identify and remove redundant terms in a corpus of terms contained within content in order to create a set of terms;
a second module to generate a matrix of data to represent the set of terms;
a third module to use an algorithm in order to determine a first value associated with the matrix of data;
a fourth module to correlate the first value assigned to the matrix of data to a category nearest in value to the first value associated with the matrix of data;
a fifth module to determine a probability of whether the content relates to a particular subject; and
a sixth module to limit the incorporation of higher order terms into the set of terms to only those higher order terms that have classifiers that contribute to a statistically probable result.
-
-
19. A method comprising:
-
eliminating redundant terms in corpus of terms contained within content by using a first algorithm to generate a set of terms;
limiting the incorporation of higher order terms into the set of terms to only those higher order terms that have classifiers that contribute to a statistically probable result;
generating a matrix of data to represent the set of terms;
determining a first value associated with the matrix of data by using a second algorithm;
correlating the first value associated with the matrix of data to a category nearest in value to the first value associated with the matrix of data; and
determining a probability indication of whether the content relates to a particular subject.
-
Specification