Text-classification system and method
First Claim
1. A computer-executed method for classifying a target document in the form of a digitally encoded natural-language text into one or more of two or more different classes, comprising the steps of:
- (a) for each of a plurality of terms composed of non-generic words and, optionally, proximately arranged word groups in the target document, selecting a term as a descriptive term if the term has an above-threshold selectivity value in at least one library of texts in a field, where the selectivity value of the term in the library of texts in the field is related to the frequency of occurrence of that the term in said library, relative to the frequency of occurrence of the same term in one or more other libraries of texts in one or more other fields, respectively,(b) determining for each of a plurality of sample texts, a match score related to the number of descriptive terms present in or derived from that the text that match those in the target document, where each of the plurality of sample texts has an associated classification identifier that identifies the one of more different classes to which that the text belongs,(c) selecting one or more of the sample texts having the highest match scores,(d) recording the one or more classification identifiers associated with the one or more sample texts having the highest match scores, and(e) associating the one or more classification identifiers from step (d) with the target document, thereby to classify the target document as belonging to one or more classes represented by at least one of the classification identifiers from step (d).
1 Assignment
0 Petitions
Accused Products
Abstract
Disclosed are a computer-readable code, system and method for classifying a target document in the form of a digitally encoded natural-language text as belonging to one or more of two or more different classes. Each of a plurality of non-generic words and optionally, words groups characterizing the target document is selected as a descriptive term if the term has an above-threshold selectivity value in at least one library of texts in a field, where the selectivity value of a term is a measure of the field-specificity of that term. There is then determined, for each of the plurality of sample texts having associated classification identifiers, a match score related to the number of descriptive terms present in or derived from that text that match those in the target text. From the selected matched texts, and the associated classification identifiers, a classification determination of the target document is made.
93 Citations
26 Claims
-
1. A computer-executed method for classifying a target document in the form of a digitally encoded natural-language text into one or more of two or more different classes, comprising the steps of:
-
(a) for each of a plurality of terms composed of non-generic words and, optionally, proximately arranged word groups in the target document, selecting a term as a descriptive term if the term has an above-threshold selectivity value in at least one library of texts in a field, where the selectivity value of the term in the library of texts in the field is related to the frequency of occurrence of that the term in said library, relative to the frequency of occurrence of the same term in one or more other libraries of texts in one or more other fields, respectively, (b) determining for each of a plurality of sample texts, a match score related to the number of descriptive terms present in or derived from that the text that match those in the target document, where each of the plurality of sample texts has an associated classification identifier that identifies the one of more different classes to which that the text belongs, (c) selecting one or more of the sample texts having the highest match scores, (d) recording the one or more classification identifiers associated with the one or more sample texts having the highest match scores, and (e) associating the one or more classification identifiers from step (d) with the target document, thereby to classify the target document as belonging to one or more classes represented by at least one of the classification identifiers from step (d). - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 21)
-
-
16. An automated system for classifying a target document in the form of a digitally encoded text as belonging to one or more of a plurality of different classes comprising
(1) a computer, (2) accessible by said computer, a database of word records, where each record includes text identifiers of the library texts that contain the word, associated library and classification identifiers for each text, and optionally, one or more selectivity values for each word, where the selectivity value of a term in a library of texts in a field is related to the frequency of occurrence of the term in said library, relative to the frequency of occurrence of the same term in one or more other libraries of texts in one or more other fields, respectively, (3) a computer readable code which is operable, under the control of said computer, to perform the steps of (a) for each of a plurality of terms composed of non-generic words and, optionally, proximately arranged word groups characterizing the target document, selecting the term as a descriptive term if the term has an above-threshold selectivity value in at least one library of texts in a field, by (i) accessing said database and (ii) calculating or recording from the database, the selectivity value associated with the term, (b) determining for each of the plurality of library texts, a match score related to the number of descriptive terms present in or derived from the text that match those in the target document, (c) selecting one or more of the library texts having the highest match scores, (d) recording the one or more classification identifiers associated with the one or more library texts having the highest match scores, and (e) associating the one or more classification identifiers from step (d) with the target document, thereby to classify the target document as belonging to at least class represented by the classification identifiers from step (d).
-
18. The system of 16, wherein said code is operable, in carrying out the step of determining match scores, to (i) access the database, to identify library texts associated with each descriptive word in the target text, and (ii) from the identified texts recorded in step (i), determine text match score based on number of descriptive words in a text, weighted by the selectivity values of the matching words.
-
22. Computer readable code for use with an electronic computer and a database word records in classifying a target document in the form of a digitally encoded text as belonging to one or more of a plurality of different classes, where each record in the word records database includes text identifiers of the library texts that contain the word, an associated library identifier for each text, an associated classification identifier for each text, and optionally, one or more selectivity values for each word, where the selectivity value of a term in a library of texts in a field is related to the frequency of occurrence of the term in said library, relative to the frequency of occurrence of the same term in one or more other libraries of texts in one or more other fields, respectively, said code being operable, under the control of said computer, to perform the steps of
(a) for each of a plurality of terms composed of non-generic words and, optionally, proximately arranged word groups characterizing the target document, selecting the term as a descriptive term if the term has an above-threshold selectivity value in at least one library of texts in the field, by (i) accessing said database and (ii) calculating or recording from the database, the selectivity value associated with the term, (b) determining for each of the plurality of library texts, a match score related to the number of descriptive terms present in or derived from text that match those in the target document, (c) selecting one or more of the library texts having the highest match scores, (d) recording the one or more classification identifiers associated with the one or more library texts having the highest match scores, and (e) associating the one or more classification identifiers from step (d) with the target document, thereby to classify the target document as belonging to at least one class represented by the classification identifiers from step (d).
Specification